第二版
Second Edition
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
Chapter 2. History of GPU Computing
2.1 Evolution of Graphics Pipelines
2.2 GPGPU: An Intermediate Step
References and Further Reading
Chapter 3. Introduction to Data Parallelism and CUDA C
3.4 Device Global Memory and Data Transfer
3.5 Kernel Functions and Threading
Chapter 4. Data-Parallel Execution Model
4.2 Mapping Threads to Multidimensional Data
4.3 Matrix-Matrix Multiplication—A More Complex Kernel
4.4 Synchronization and Transparent Scalability
4.5 Assigning Resources to Blocks
4.6 Querying Device Properties
4.7 Thread Scheduling and Latency Tolerance
5.1 Importance of Memory Access Efficiency
5.3 A Strategy for Reducing Global Memory Traffic
5.4 A Tiled Matrix–Matrix Multiplication Kernel
5.5 Memory as a Limiting Factor to Parallelism
Chapter 6. Performance Considerations
6.1 Warps and Thread Execution
6.3 Dynamic Partitioning of Execution Resources
6.4 Instruction Mix and Thread Granularity
Chapter 7. Floating-Point Considerations
7.3 Special Bit Patterns and Precision in IEEE Format
7.4 Arithmetic Accuracy and Rounding
Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches
8.2 1D Parallel Convolution—A Basic Algorithm
8.3 Constant Memory and Caching
8.4 Tiled 1D Convolution with Halo Elements
8.5 A Simpler Tiled 1D Convolution—General Caching
Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms
9.3 Work Efficiency Considerations
9.4 A Work-Efficient Parallel Scan
9.5 Parallel Scan for Arbitrary-Length Inputs
第 10 章并行模式:稀疏矩阵向量乘法:并行算法中的压缩和正则化简介
10.3 Padding and Transposition
10.4 Using Hybrid to Control Padding
10.5 Sorting and Partitioning for Regularization
Chapter 11. Application Case Study: Advanced MRI Reconstruction
Chapter 12. Application Case Study: Molecular Visualization and Analysis
12.2 A Simple Kernel Implementation
12.3 Thread Granularity Adjustment
Chapter 13. Parallel Programming and Computational Thinking
13.1 Goals of Parallel Computing
Chapter 14. An Introduction to OpenCLTM
14.5 Device Management and Kernel Launch
14.6 Electrostatic Potential Map in OpenCL
Chapter 15. Parallel Programming with OpenACC
15.5 Future Directions of OpenACC
Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
17.1 CUDA FORTRAN and CUDA C Differences
17.2 A First CUDA FORTRAN Program
17.3 Multidimensional Array in CUDA FORTRAN
17.4 Overloading Host/Device Routines With Generic Interfaces
17.5 通过 Iso_C_Binding 调用 CUDA C
17.5 Calling CUDA C Via Iso_C_Binding
17.6 Kernel Loop Directives and Reduction Operations
17.8 Asynchronous Data Transfers
17.9 Compilation and Profiling
17.10 Calling Thrust from CUDA FORTRAN
Chapter 18. An Introduction to C++ AMP
18.2 Details of the C++ AMP Execution Model
18.5 C++ AMP Graphics Features
Chapter 19. Programming a Heterogeneous Computing Cluster
19.4 MPI Point-to-Point Communication Types
19.5 Overlapping Computation and Communication
19.6 MPI Collective Communication
Chapter 20. CUDA Dynamic Parallelism
20.2 Dynamic Parallelism Overview
Chapter 21. Conclusion and Future Outlook
21.3 Kernel Execution Control Evolution
Appendix A. Matrix Multiplication Host-Only Version Source Code
Appendix B. GPU Compute Capabilities
B.1 GPU Compute Capability Tables
编辑项目经理:纳撒尼尔·麦克法登
Editorial Project Manager: Nathaniel McFadden
项目经理:Priya Kumaraguruparan
Project Manager: Priya Kumaraguruparan
设计师:艾伦·斯塔德霍姆
Designer: Alan Studholme
摩根考夫曼是爱思唯尔的品牌
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, 沃尔瑟姆, MA, 02451, 美国
225 Wyman Street, Waltham, MA, 02451, USA
© 2013、2010 David B. Kirk/NVIDIA Corporation 和 Wen-mei Hwu。由爱思唯尔公司出版。保留所有权利。
© 2013, 2010 David B. Kirk/NVIDIA Corporation and Wen-mei Hwu. Published by Elsevier Inc. All rights reserved.
未经出版商书面许可,不得以任何形式或任何方式(电子或机械)复制或传播本出版物的任何部分,包括复印、录音或任何信息存储和检索系统。有关如何寻求许可的详细信息、有关出版商的许可政策以及我们与版权清算中心和版权许可机构等组织的安排的更多信息,请访问我们的网站:www.elsevier.com/permissions。
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
本书及其中包含的个人贡献受出版商的版权保护(本文中注明的除外)。
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
通知
Notices
该领域的知识和最佳实践在不断变化。随着新的研究和经验拓宽我们的理解,研究方法或专业实践的改变可能变得必要。从业者和研究人员必须始终依靠自己的经验和知识来评估和使用本文描述的任何信息或方法。在使用此类信息或方法时,他们应注意自己和他人的安全,包括他们负有专业责任的各方的安全。
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
在法律允许的最大范围内,出版商、作者、贡献者或编辑均不对因产品责任、疏忽或其他原因或因任何使用或此处材料中包含的任何方法、产品、说明或想法的操作。
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability,negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
美国国会图书馆出版数据编目
Library of Congress Cataloging-in-Publication Data
已提交申请
Application submitted
大英图书馆出版编目数据
British Library Cataloguing-in-Publication Data
本书的目录记录可从大英图书馆获取。
A catalogue record for this book is available from the British Library.
国际标准书号:978-0-12-415992-1
ISBN: 978-0-12-415992-1
美国印刷
Printed in the United States of America
13 14 15 16 17 10 9 8 7 6 5 4 3 2 1
13 14 15 16 17 10 9 8 7 6 5 4 3 2 1
有关所有 MK 出版物的信息,请访问我们的网站www.mkp.com
For information on all MK publications visit our website at www.mkp.com
我们很自豪地推出第二版《大规模并行处理器编程:实践方法》。结合了多核计算机处理单元 (CPU) 和多线程 GPU 的大众市场计算系统为笔记本电脑带来了万亿级计算,为集群带来了千万亿级计算。有了这样的计算能力,我们正处于科学、工程、健康和商业学科普遍使用计算实验的黎明。许多人将能够利用规模、准确性、可控性和可观察性达到前所未有的水平的计算实验来实现其学科的突破。本书为这一愿景提供了一个关键要素:向数百万研究生和本科生教授并行编程,以便计算思维和并行编程技能将像微积分一样普及。
We are proud to introduce the second edition of Programming Massively Parallel Processors: A Hands-on Approach. Mass-market computing systems that combine multicore computer processing units (CPUs) and many-thread GPUs have brought terascale computing to laptops and petascale computing to clusters. Armed with such computing power, we are at the dawn of pervasive use of computational experiments for science, engineering, health, and business disciplines. Many will be able to achieve breakthroughs in their disciplines using computational experiments that are of an unprecedented level of scale, accuracy, controllability, and observability. This book provides a critical ingredient for the vision: teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus.
自 2010 年第一版出版以来,我们收到了读者和教师的大量评论。许多人告诉我们他们看重的现有功能。其他人为我们提供了如何扩展其内容以使这本书更有价值的想法。此外,异构并行计算的硬件和软件技术也取得了巨大进步。在硬件领域,自第一版以来又推出了两代图形处理单元(GPU)计算架构:费米(Fermi)和开普勒(Kepler)。在软件领域,CUDA 4.0和CUDA 5.0允许程序员访问Fermi和Kepler的新硬件功能。因此,我们添加了八个新章节并完全重写了现有的五个章节。
Since the first edition came out in 2010, we have received numerous comments from our readers and instructors. Many told us about the existing features they value. Others gave us ideas about how we should expand its contents to make the book even more valuable. Furthermore, the hardware and software technology for heterogeneous parallel computing has advanced tremendously. In the hardware arena, two more generations of graphics processing unit (GPU) computing architectures, Fermi and Kepler, have been introduced since the first edition. In the software domain, CUDA 4.0 and CUDA 5.0 have allowed programmers to access the new hardware features of Fermi and Kepler. Accordingly, we added eight new chapters and completely rewrote five existing chapters.
总的来说,我们的目标是在第二版中实现三项重大改进,同时保留第一版最有价值的功能。第一个改进是以更系统的方式引入并行编程。这是通过以下方式完成的:(1) 添加新的第8、9和10章,介绍常用的基本并行算法模式; (2)在第3、4、5、6章中添加更多背景材料; (3)在第7章中增加了数值稳定性的处理。这些新增内容旨在消除学生已经熟悉基本并行编程概念的假设。它们还有助于满足读者对更多示例的渴望。
Broadly speaking, we aim for three major improvements in the second edition while preserving the most valued features of the first edition. The first improvement is to introduce parallel programming in a more systematic way. This is done by (1) adding new Chapters 8, 9, and 10 that introduce frequently used, basic parallel algorithm patterns; (2) adding more background material to Chapters 3, 4, 5, and 6; and (3) adding a treatment of numerical stability to Chapter 7. These additions are designed to remove the assumption that students are already familiar with basic parallel programming concepts. They also help to address the desire for more examples by our readers.
第二个改进是涵盖在异构计算集群中使用联合 MPI-CUDA 编程的实用技术。这是我们的读者经常要求添加的内容。由于 GPU 的成本效益和每瓦特的高吞吐量,许多高性能计算系统现在在每个节点中配置 GPU。新的第 19 章解释了这些系统编程接口背后的概念框架。
The second improvement is to cover practical techniques for using joint MPI-CUDA programming in a heterogeneous computing cluster. This has been a frequently requested addition by our readers. Due to the cost-effectiveness and high throughput per watt of GPUs, many high-performance computing systems now provision GPUs in each node. The new Chapter 19 explains the conceptual framework behind the programming interfaces of these systems.
第三个改进是引入了新的并行编程接口和工具,可以显着提高数据并行编程的生产力。新的第 15、16、17和18章介绍了 OpenACC、Thrust、CUDA FORTRAN 和 C++ AMP。我们不是从用户指南中复制这些工具的详细描述,而是专注于对这些工具旨在解决的编程问题的概念理解。
The third improvement is an introduction of new parallel programming interfaces and tools that can significantly improve the productivity of data-parallel programming. The new Chapters 15, 16, 17, and 18 introduce OpenACC, Thrust, CUDA FORTRAN, and C++AMP. Instead of replicating the detailed descriptions of these tools from their user guides, we focus on the conceptual understanding of the programming problems that these tools are designed to solve.
在我们进行所有这些改进的同时,我们还保留了第一版的功能,这些功能似乎有助于其流行。首先,我们使本书尽可能简洁。虽然不断添加材料很诱人,但我们希望最大限度地减少读者学习所有关键概念所需的页面数量。其次,我们的解释尽可能直观。虽然形式化某些概念非常诱人,尤其是当我们涵盖基本并行算法时,但我们努力保持所有解释直观且实用。
While we made all these improvements, we also preserved the first edition features that seem to contribute to its popularity. First, we kept the book as concise as possible. While it is very tempting to keep adding material, we want to minimize the number of pages readers need to go through to learn all the key concepts. Second, we kept our explanations as intuitive as possible. While it is extremely tempting to formalize some of the concepts, especially when we cover the basic parallel algorithms, we strive to keep all our explanations intuitive and practical.
本书的目标读者是来自所有科学和工程学科的研究生和本科生,这些学科需要计算思维和并行编程技能才能实现突破。我们假设读者至少有一些基本的 C 编程经验。我们特别针对机械工程、土木工程、电气工程、生物工程、物理、化学、天文学和地理学等领域的计算科学家,他们利用计算来进一步推进其研究领域。因此,这些科学家既是各自领域的专家,又是程序员。本书采用以基本 C 编程技能为基础的方法来教授 C 语言并行编程。我们使用 CUDA C,这是一种受 NVIDIA GPU 支持并在 CPU 上模拟的并行编程环境。消费者和专业人士手中有超过 3.75 亿个此类处理器,并且有超过 120,000 名程序员积极使用 CUDA。作为学习体验的一部分,您开发的应用程序将能够由非常大的用户社区运行。
The target audience of this book is graduate and undergraduate students from all science and engineering disciplines where computational thinking and parallel programming skills are needed to achieve breakthroughs. We assume that readers have at least some basic C programming experience. We especially target computational scientists in fields such as mechanical engineering, civil engineering, electrical engineering, bio-engineering, physics, chemistry, astronomy, and geography, who use computation to further their field of research. As such, these scientists are both experts in their domain as well as programmers. The book takes the approach of building on basic C programming skills, to teach parallel programming in C. We use CUDA C, a parallel programming environment that is supported on NVIDIA GPUs and emulated on CPUs. There are more than 375 million of these processors in the hands of consumers and professionals, and more than 120,000 programmers actively using CUDA. The applications that you develop as part of the learning experience will be able to run by a very large user community.
我们希望通过本书提供一些教学课程的经验。自 2006 年以来,我们教授多种类型的课程:一学期课程和一周强化课程。原来的ECE498AL课程已成为伊利诺伊大学厄巴纳-香槟分校的永久课程,称为ECE408或CS483。当我们第二次提供 ECE498AL 时,我们开始编写本书的一些早期章节。前四章也在 Nicolas Pinto 于 2009 年春天教授的麻省理工学院课程中进行了测试。从那时起,我们在 ECE408 以及 VSCSE 和 PUMPS 暑期学校的众多课程中使用了这本书。
We would like to offer some of our experience in teaching courses with this book. Since 2006, we have taught multiple types of courses: in one-semester format and in one-week intensive format. The original ECE498AL course has become a permanent course known as ECE408 or CS483 of the University of Illinois at Urbana-Champaign. We started to write up some early chapters of this book when we offered ECE498AL the second time. The first four chapters were also tested in an MIT class taught by Nicolas Pinto in the spring of 2009. Since then, we have used the book for numerous offerings of ECE408 as well as the VSCSE and PUMPS summer schools.
在 ECE498AL 中,讲座和编程作业相互平衡,分为三个阶段:
In ECE498AL the lectures and programming assignments are balanced with each other and organized into three phases:
第 1 阶段:基于第 3 章的一个讲座,致力于教授基本 CUDA 内存/线程模型、CUDA 对 C 语言的扩展以及基本编程/调试工具。讲座结束后,学生可以在几个小时内编写出简单的向量加法代码。接下来是一系列的四场讲座,让学生对 CUDA 内存模型、CUDA 线程执行模型、GPU 硬件性能特征和现代计算机系统架构有概念性的理解。这些讲座以第4、5和6章为基础。
Phase 1: One lecture based on Chapter 3 is dedicated to teaching the basic CUDA memory/threading model, the CUDA extensions to the C language, and the basic programming/debugging tools. After the lecture, students can write a simple vector addition code in a couple of hours. This is followed by a series of four lectures that give students the conceptual understanding of the CUDA memory model, the CUDA thread execution model, GPU hardware performance features, and modern computer system architecture. These lectures are based on Chapters 4, 5, and 6.
第 2 阶段:一系列讲座涵盖并行计算中的浮点注意事项以及开发高性能并行应用程序所需的常见数据并行编程模式。这些讲座基于第 7 章至第 10章。在此期间,其矩阵乘法代码的性能提高了约 10 倍。在此期间,学生还完成了有关卷积、向量约简和前缀和的作业。
Phase 2: A series of lectures covers floating-point considerations in parallel computing and common data-parallel programming patterns needed to develop a high-performance parallel application. These lectures are based on Chapters 7–10. The performance of their matrix multiplication codes increases by about 10 times through this period. The students also complete assignments on convolution, vector reduction, and prefix sum through this period.
第 3 阶段:一旦学生建立了扎实的 CUDA 编程技能,剩下的讲座将涵盖应用案例研究、计算思维、更广泛的并行执行模型和并行编程原理。这些讲座基于第 11 章至第 20 章。 (这些讲座的语音和视频录音可在 ECE408 网站上在线获取:http: //courses.engr.illinois.edu/ece408/ 。
Phase 3: Once the students have established solid CUDA programming skills, the remaining lectures cover application case studies, computational thinking, a broader range of parallel execution models, and parallel programming principles. These lectures are based on Chapters 11–20. (The voice and video recordings of these lectures are available online at the ECE408 web site: http://courses.engr.illinois.edu/ece408/.
虽然本书的讲座、实验和章节有助于为学生奠定知识基础,但将学习经验整合在一起的是最终项目。期末项目对于整个学期的课程非常重要,因此它在课程中占据显着位置,占据了近两个月的重点。它包含五个创新方面:指导、研讨会、临床、最终报告和研讨会。 (虽然有关最终项目的大部分信息都可以在 ECE408 网站上找到,但我们愿意提供这些方面设计背后的想法。)
While the lectures, labs, and chapters of this book help lay the intellectual foundation for the students, what brings the learning experience together is the final project. The final project is so important to the full-semester course that it is prominently positioned in the course and commands nearly two months’ focus. It incorporates five innovative aspects: mentoring, workshop, clinic, final report, and symposium. (While much of the information about the final project is available at the ECE408 web site, we would like to offer the thinking that was behind the design of these aspects.)
鼓励学生将他们的最终项目基于代表研究界当前挑战的问题。为了推动这一过程,教师应该招募几个计算科学研究小组来提出问题并担任导师。导师被要求提供一份一到两页的项目规格表,简要描述申请的重要性、导师希望与学生团队一起完成申请的目标、技术技能(特定类型的数学、物理) ,或化学课程)需要理解和处理应用程序,以及网络列表学生可以利用这些传统资源来获取技术背景、一般信息和构建块,以及特定实现和编码示例的特定 URL 或 FTP 路径。这些项目规格表还为学生提供了在其职业生涯后期定义自己的研究项目的学习经验。 (ECE408 课程网站上提供了几个示例。)
Students are encouraged to base their final projects on problems that represent current challenges in the research community. To seed the process, the instructors should recruit several computational science research groups to propose problems and serve as mentors. The mentors are asked to contribute a one- to two-page project specification sheet that briefly describes the significance of the application, what the mentor would like to accomplish with the student teams on the application, the technical skills (particular type of math, physics, or chemistry courses) required to understand and work on the application, and a list of web and traditional resources that students can draw upon for technical background, general information, and building blocks, along with specific URLs or FTP paths to particular implementations and coding examples. These project specification sheets also provide students with learning experiences in defining their own research projects later in their careers. (Several examples are available at the ECE408 course web site.)
还鼓励学生在项目选择过程中联系潜在的导师。一旦学生和导师就一个项目达成一致,他们就建立了密切的关系,包括频繁的咨询和项目报告。教师应该努力促进学生和导师之间的协作关系,使其成为导师和学生双方都非常宝贵的经历。
Students are also encouraged to contact their potential mentors during their project selection process. Once the students and the mentors agree on a project, they enter into a close relationship, featuring frequent consultation and project reporting. The instructors should attempt to facilitate the collaborative relationship between students and their mentors, making it a very valuable experience for both mentors and students.
项目研讨会是全班同学互相交流最终项目想法的主要途径。我们通常将六个讲座时段专门用于项目研讨会。这些研讨会是为了学生的利益而设计的。例如,如果学生确定了一个项目,研讨会将成为展示初步想法、获取反馈和招募队友的场所。如果学生尚未确定项目,他或她只需参加演示、参与讨论并加入其中一个项目团队即可。研讨会期间不对学生进行评分,以保持气氛不具威胁性,并使他们能够专注于与教师、助教和班上其他人进行有意义的对话。
The main vehicle for the whole class to contribute to each other’s final project ideas is the project workshop. We usually dedicate six of the lecture slots to project workshops. The workshops are designed for students’ benefit. For example, if a student has identified a project, the workshop serves as a venue to present preliminary thinking, get feedback, and recruit teammates. If a student has not identified a project, he or she can simply attend the presentations, participate in the discussions, and join one of the project teams. Students are not graded during the workshops, to keep the atmosphere nonthreatening and enable them to focus on a meaningful dialog with the instructors, teaching assistants, and the rest of the class.
研讨会日程的设计是为了让讲师和助教可以花一些时间向项目团队提供反馈,以便学生可以提出问题。演示时间限制为 10 分钟,因此课堂期间有时间提供反馈和提问。这将班级规模限制为大约 36 名演讲者(假设讲座时间为 90 分钟)。所有演示文稿均预先加载到 PC 中,以严格控制日程并最大限度地延长反馈时间。由于并非所有学生都参加研讨会,因此我们每个班级最多可容纳 50 名学生,并可根据需要提供额外的研讨会时间。
The workshop schedule is designed so the instructors and teaching assistants can take some time to provide feedback to the project teams and so that students can ask questions. Presentations are limited to 10 minutes so there is time for feedback and questions during the class period. This limits the class size to about 36 presenters, assuming 90-minute lecture slots. All presentations are preloaded into a PC to control the schedule strictly and maximize feedback time. Since not all students present at the workshop, we have been able to accommodate up to 50 students in each class, with extra workshop time available as needed.
教师和助教必须承诺参加所有演讲并提供有用的反馈。学生通常在回答以下问题时最需要帮助:(1) 对于可用时间来说,项目是否太大或太小? (2) 该领域是否存在可以使该项目受益的现有工作? (3) 并行执行的计算是否适合 CUDA 编程模型?
The instructors and teaching assistants must make a commitment to attend all the presentations and to give useful feedback. Students typically need the most help in answering the following questions: (1) Are the projects too big or too small for the amount of time available? (2) Is there existing work in the field that the project can benefit from? (3) Are the computations being targeted for parallel execution appropriate for the CUDA programming model?
一旦学生决定一个项目并组建团队,他们就需要提交该项目的设计文件。这有助于他们在投入项目之前仔细考虑项目步骤。进行此类规划的能力对于他们以后的职业成功非常重要。设计文件应讨论背景项目的动机、应用程序级目标和潜在影响、最终应用程序的主要特征、设计概述、实施计划、性能目标、验证计划和验收测试以及项目进度表。
Once the students decide on a project and form a team, they are required to submit a design document for the project. This helps them think through the project steps before they jump into it. The ability to do such planning will be important to their later career success. The design document should discuss the background and motivation for the project, application-level objectives and potential impact, main features of the end application, an overview of their design, an implementation plan, their performance goals, a verification plan and acceptance test, and a project schedule.
助教在班级研讨会前一周为最终项目团队举办项目诊所。该诊所有助于确保学生走上正轨,并在过程中尽早发现潜在的障碍。学生团队被要求带着以下三个应用程序版本的初稿来到诊所:(1) 性能方面最好的 CPU 顺序代码,采用 SSE2 和其他优化,为代码建立强大的串行基础。它们的加速比较和 (2) 性能方面最好的 CUDA 并行代码——这个版本是该项目的主要输出。学生使用该版本来描述并行算法开销(涉及额外计算)。
The teaching assistants hold a project clinic for final project teams during the week before the class symposium. This clinic helps ensure that students are on track and that they have identified the potential roadblocks early in the process. Student teams are asked to come to the clinic with an initial draft of the following three versions of their application: (1) the best CPU sequential code in terms of performance, with SSE2 and other optimizations that establish a strong serial base of the code for their speedup comparisons and (2) the best CUDA parallel code in terms of performance—this version is the main output of the project. This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved.
要求学生团队准备好讨论每个版本的代码中使用的关键思想、任何浮点数值问题、与应用程序的先前结果的任何比较,以及如果他们实现巨大的加速,对该领域的潜在影响。根据我们的经验,诊所的最佳时间安排是在班级研讨会前一周。较早的时间通常会导致项目不太成熟,会议的意义也较小。稍后将不会给学生足够的时间根据反馈修改他们的项目。
Student teams are asked to be prepared to discuss the key ideas used in each version of the code, any floating-point numerical issues, any comparison against previous results on the application, and the potential impact on the field if they achieve tremendous speedup. From our experience, the optimal schedule for the clinic is one week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions. A later time will not give students sufficient time to revise their projects according to the feedback.
学生需要提交一份关于其团队主要发现的项目报告。六个讲座时段合并为一个全天的班级研讨会。在研讨会期间,学生使用与团队规模成比例的演示时段。在演示过程中,学生们会重点介绍项目报告中最好的部分,以使全班受益。演讲占学生成绩的很大一部分。每个学生必须回答针对他或她个人的问题,以便可以为同一团队中的个人分配不同的成绩。我们已录制这些演示文稿,供未来的学生在 ECE408 网站上观看。研讨会为学生提供了一个重要的机会,让他们学习如何进行简洁的演示,从而激励他们的同伴阅读全文。演讲结束后,学生们还提交了关于他们的最终项目的完整报告。
Students are required to submit a project report on their team’s key findings. Six lecture slots are combined into a whole-day class symposium. During the symposium, students use presentation slots proportional to the size of the teams. During the presentation, the students highlight the best parts of their project report for the benefit of the whole class. The presentation accounts for a significant part of students’ grades. Each student must answer questions directed to him or her as individuals, so that different grades can be assigned to individuals in the same team. We have recorded these presentations for viewing by future students at the ECE408 web site. The symposium is a major opportunity for students to learn to produce a concise presentation that motivates their peers to read a full paper. After their presentation, the students also submit a full report on their final project.
在课堂上使用本书的教师可以获得实验室作业、最终项目指南和示例项目规范。虽然本书为这些课程提供了知识内容,但附加材料对于实现总体教育目标至关重要。我们想邀请您参加利用本书随附的在线材料,该材料可在
The lab assignments, final project guidelines, and sample project specifications are available to instructors who use this book for their classes. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals. We would like to invite you to take advantage of the online material that accompanies this book, which is available at
最后,我们鼓励您提交反馈。如果您有任何改进本书的想法,我们希望收到您的来信。我们想知道如何改进在线补充材料。当然,我们也想知道您喜欢这本书的哪些方面。我们期待您的回音。
Finally, we encourage you to submit your feedback. We would like to hear from you if you have any ideas for improving this book. We would like to know how we can improve the supplementary online material. Of course, we also like to know what you liked about the book. We look forward to hearing from you.
有很多人为第二版做出了特殊贡献。我们首先要感谢新章节的贡献作者。 Yuan Lin 和 Vinod Grover 撰写了 OpenACC 章节的初稿。 Nathan Bell 和 Jared Hoberock 撰写了 Thrust 章节的初稿,Chris Rodrigues 对基本概念做出了额外贡献。 Greg Ruetsch 和 Massimiliano Fatica 撰写了 CUDA FORTRAN 章节的初稿。 David Callahan 撰写了 C++AMP 章节。 Isaac Gelado 撰写了 MPI-CUDA 章节的原始草案。 Brent Oster 为开普勒章节的基础材料和代码示例做出了贡献。如果没有这些人的专业知识和贡献,我们将无法以我们希望向读者提供的洞察力来涵盖这些新的编程模型。
There are so many people who have made special contributions to the second edition. We would like to first thank the contributing authors of the new chapters. Yuan Lin and Vinod Grover wrote the original draft of the OpenACC chapter. Nathan Bell and Jared Hoberock wrote the original draft of the Thrust chapter, with additional contributions on the foundational concepts from Chris Rodrigues. Greg Ruetsch and Massimiliano Fatica wrote the original draft of the CUDA FORTRAN chapter. David Callahan wrote the C++AMP Chapter. Isaac Gelado wrote the original draft of the MPI-CUDA chapter. Brent Oster contributed to base material and code examples of the Kepler chapter. Without the expertise and contribution of these individuals, we would not have been able to cover these new programming models with the level of insight that we wanted to provide to our readers.
我们要特别感谢 Izzat El Hajj,他孜孜不倦地帮助验证代码示例并提高插图和练习的质量。
We would like to give special thanks to Izzat El Hajj, who tirelessly helped to verify the code examples and improved the quality of illustrations and exercises.
我们要特别感谢 CUDA 之父 Ian Buck 和 Tesla GPU 计算架构的首席架构师 John Nickolls。他们的团队为本课程奠定了良好的基础设施。当我们制作第二版时,约翰去世了。我们非常想念他。 Nadeem Mohammad 组织了 NVIDIA 审核工作,并为附录 B做出了贡献。 Bill Bean、Simon Green、Mark Harris、Nadeem Mohammad、Brent Oster、Peter Shirley、Eric Young 和 Cyril Zeller 对手稿提供了审稿意见和更正。卡丽莎·科尔 (Calisa Cole) 帮忙掩护。纳迪姆的英勇努力对于本书的完成至关重要。
We would like to especially acknowledge Ian Buck, the father of CUDA and John Nickolls, the lead architect of Tesla GPU Computing Architecture. Their teams laid an excellent infrastructure for this course. John passed away while we were working on the second edition. We miss him dearly. Nadeem Mohammad organized the NVIDIA review efforts and also contributed to Appendix B. Bill Bean, Simon Green, Mark Harris, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young and Cyril Zeller provided review comments and corrections to the manuscripts. Calisa Cole helped with cover. Nadeem’s heroic efforts have been critical to the completion of this book.
我们要特别感谢黄仁勋为开发这门课程提供了大量的财力和人力资源,为本书奠定了基础。托尼·塔马西 (Tony Tamasi) 的团队为本书章节的审查和修订做出了巨大贡献。 Jensen 还花时间阅读了章节的早期草稿,并向我们提供了宝贵的反馈。 David Luebke 为本课程提供了 GPU 计算资源。乔纳·阿尔本提供了宝贵的见解。迈克尔·谢巴诺 (Michael Shebanow) 和迈克尔·加兰 (Michael Garland) 做过客座讲座并提供了材料。
We would like to especially thank Jensen Huang for providing a great amount of financial and human resources for developing the course that laid the foundation for this book. Tony Tamasi’s team contributed heavily to the review and revision of the book chapters. Jensen also took the time to read the early drafts of the chapters and gave us valuable feedback. David Luebke has facilitated the GPU computing resources for the course. Jonah Alben has provided valuable insight. Michael Shebanow and Michael Garland have given guest lectures and offered materials.
伊利诺伊州的 John Stone 和 Sam Stone 为案例研究和 OpenCL 章节贡献了大部分基础材料。约翰·斯特拉顿 (John Stratton) 和克里斯·罗德里格斯 (Chris Rodrigues) 为计算思维章节贡献了一些基础材料。 I-Jui “Ray” Sung、John Stratton、Xiao-Long Wu、Nady Obeid 为实验室材料做出了贡献,并在研究之外自愿担任助教时帮助修改了课程材料。 Jeremy Enos 孜孜不倦地努力确保学生拥有稳定、用户友好的 GPU 计算集群来完成他们的实验室作业和项目。
John Stone and Sam Stone in Illinois contributed much of the base material for the case study and OpenCL chapters. John Stratton and Chris Rodrigues contributed some of the base material for the computational thinking chapter. I-Jui “Ray” Sung, John Stratton, Xiao-Long Wu, Nady Obeid contributed to the lab material and helped to revise the course material as they volunteered to serve as teaching assistants on top of their research. Jeremy Enos worked tirelessly to ensure that students have a stable, user-friendly GPU computing cluster to work on their lab assignments and projects.
我们要感谢迪克·布拉胡特 (Dick Blahut),他向我们发起挑战,要求我们在伊利诺伊州创建该课程。他不断提醒我们需要写这本书,这帮助我们继续前进。 Beth Katsinas 安排了 Dick Blahut 与 NVIDIA 副总裁 Dan Vivoli 的会面。通过那次聚会,布拉胡特被介绍给大卫,并挑战大卫来伊利诺伊州与文梅一起创建球场。
We would like to acknowledge Dick Blahut who challenged us to create the course in Illinois. His constant reminder that we needed to write the book helped keep us going. Beth Katsinas arranged a meeting between Dick Blahut and NVIDIA Vice President Dan Vivoli. Through that gathering, Blahut was introduced to David and challenged David to come to Illinois and create the course with Wen-mei.
我们还要感谢伊利诺伊大学的 Thom Dunning 和密歇根大学的 Sharon Glotzer(多大学计算科学与工程虚拟学院的联合主任)慷慨地主办了该课程的暑期学校版本。 Trish Barker、Scott Lathrop、Umesh Thakkar、Tom Scavo、Andrew Schuh 和 Beth McKown 都帮助组织了暑期学校。 Robert Brunner、Klaus Schulten、Pratap Vanka、Brad Sutton、John Stone、Keith Thulborn、Michael Garland、Vlad Kindratenko、Naga Govindaraju、Yan Xu、Arron Shinn 和 Justin Haldar 为暑期学校的讲座和小组讨论做出了贡献。
We would also like to thank Thom Dunning of the University of Illinois and Sharon Glotzer of the University of Michigan, Co-Directors of the multi-university Virtual School of Computational Science and Engineering, for graciously hosting the summer school version of the course. Trish Barker, Scott Lathrop, Umesh Thakkar, Tom Scavo, Andrew Schuh, and Beth McKown all helped organize the summer school. Robert Brunner, Klaus Schulten, Pratap Vanka, Brad Sutton, John Stone, Keith Thulborn, Michael Garland, Vlad Kindratenko, Naga Govindaraju, Yan Xu, Arron Shinn, and Justin Haldar contributed to the lectures and panel discussions at the summer school.
尼古拉斯·平托(Nicolas Pinto)在他的麻省理工学院课程中测试了第一章的早期版本,并收集了一套出色的反馈意见和更正。 Steve Lumetta 和 Sanjay Patel 都教授了该课程的多个版本,并向我们提供了宝贵的反馈。约翰·欧文斯慷慨地允许我们使用他的一些幻灯片。 Tor Aamodt、Dan Connors、Tom Conte、Michael Giles、Nacho Navarro 以及世界各地的许多其他讲师和他们的学生为我们提供了宝贵的反馈。
Nicolas Pinto tested the early versions of the first chapters in his MIT class and assembled an excellent set of feedback comments and corrections. Steve Lumetta and Sanjay Patel both taught versions of the course and gave us valuable feedback. John Owens graciously allowed us to use some of his slides. Tor Aamodt, Dan Connors, Tom Conte, Michael Giles, Nacho Navarro and numerous other instructors and their students worldwide have provided us with valuable feedback.
我们要特别感谢我们的同事 Kurt Akeley、Al Aho、Arvind、Dick Blahut、Randy Bryant、Bob Colwell、Ed Davidson、Mike Flynn、John Hennessy、Pat Hanrahan、Nick Holonyak、Dick Karp、Kurt Keutzer、Dave Liu、Dave Kuck、Yale Patt、David Patterson、Bob Rao、Burton Smith、Jim Smith 和 Mateo Valero 多年来花时间与我们分享他们的见解。
We would like to especially thank our colleagues Kurt Akeley, Al Aho, Arvind, Dick Blahut, Randy Bryant, Bob Colwell, Ed Davidson, Mike Flynn, John Hennessy, Pat Hanrahan, Nick Holonyak, Dick Karp, Kurt Keutzer, Dave Liu, Dave Kuck, Yale Patt, David Patterson, Bob Rao, Burton Smith, Jim Smith and Mateo Valero who have taken the time to share their insight with us over the years.
所有为这门课程和本书做出贡献的伟大人士的慷慨和热情让我们深感谦卑。
We are humbled by the generosity and enthusiasm of all the great people who contributed to the course and the book.
David B. Kirk 和 Wen-mei W. Hwu
David B. Kirk and Wen-mei W. Hwu
1.1 异构并行计算
1.1 Heterogeneous Parallel Computing
1.2 现代 GPU 的架构
1.2 Architecture of a Modern GPU
1.3 为什么需要更高的速度或并行性?
1.3 Why More Speed or Parallelism?
1.4 加速实际应用
1.4 Speeding Up Real Applications
1.5 并行编程语言和模型
1.5 Parallel Programming Languages and Models
1.6 总体目标
1.6 Overarching Goals
1.7 本书的组织
1.7 Organization of the Book
二十多年来,基于单个中央处理单元 (CPU) 的微处理器(例如 Intel Pentium 系列和 AMD Opteron 系列中的微处理器)推动了计算机应用的性能快速提升和成本降低。这些微处理器为台式机带来了 GFLOPS,即每秒千兆 (10 12 ) 浮点运算,为集群服务器带来了 TFLOPS,即每秒兆兆 (10 15 ) 浮点运算。这种对性能改进的不懈追求使得应用软件能够提供更多功能、拥有更好的用户界面并生成更有用的结果。反过来,一旦用户习惯了这些改进,他们就会要求更多的改进,从而为计算机行业创造一个积极(良性)的循环。
Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought GFLOPS, or giga (1012) floating-point operations per second, to the desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers. This relentless drive for performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive (virtuous) cycle for the computer industry.
然而,自 2003 年以来,由于能源消耗和散热问题限制了时钟频率的提高以及单个 CPU 内每个时钟周期内可以执行的生产活动水平,这种驱动力已经放缓。从那时起,几乎所有微处理器供应商都转向了在每个芯片中使用多个处理单元(称为处理器内核)的模型,以提高处理能力。这种转变对软件开发者社区产生了巨大的影响[Sutter2005]。
This drive, however, has slowed since 2003 due to energy consumption and heat dissipation issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Since then, virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. This switch has exerted a tremendous impact on the software developer community [Sutter2005].
传统上,绝大多数软件应用程序都是作为顺序程序编写的,正如冯·诺依曼在 1945 年的开创性报告中所描述的那样[vonNeumann1945]。人们可以按顺序单步执行代码来理解这些程序的执行。从历史上看,大多数软件开发人员都依赖硬件的进步来提高其顺序应用程序的速度。随着每一代新一代处理器的推出,相同的软件运行得更快。计算机用户也已经习惯于期望这些程序在每一代新一代微处理器上运行得更快。从今天起,这样的期望不再有效。顺序程序仅在其中一个处理器核心上运行,该处理器核心的速度不会比当今使用的处理器核心快得多。如果没有性能改进,应用程序开发人员将无法再随着新微处理器的推出而在其软件中引入新的特性和功能,从而减少整个计算机行业的增长机会。
Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann in his seminal report in 1945 [vonNeumann1945]. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, most software developers have relied on the advances in hardware to increase the speed of their sequential applications under the hood; the same software simply runs faster as each new generation of processors is introduced. Computer users have also become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, reducing the growth opportunities of the entire computer industry.
相反,每一代新一代微处理器将继续享受性能改进的应用软件将是并行程序,其中多个执行线程协作以更快地完成工作。这种新的、急剧升级的并行程序开发激励措施被称为并发革命[Sutter2005]。并行编程的实践绝不是新鲜事。几十年来,高性能计算社区一直在开发并行程序。这些程序在大型、昂贵的计算机上运行。只有少数精英应用程序可以证明使用这些昂贵的计算机是合理的,因此并行编程的实践仅限于少数应用程序开发人员。现在所有新的微处理器都是并行计算机,需要开发为并行程序的应用程序数量急剧增加。现在软件开发人员非常需要学习并行编程,这也是本书的重点。
Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book.
自 2003 年以来,半导体行业已经确定了微处理器设计的两条主要轨迹[Hwu2008]。多核轨迹旨在在进入多核的同时保持顺序程序的执行速度。多核始于两个核心处理器,核心数量随着每一代半导体工艺而增加。当前的一个例子是最新的 Intel Core i7 ™微处理器,具有四个处理器内核,每个处理器内核都存在乱序、多指令问题处理器实现完整的 X86 指令集,支持具有两个硬件线程的超线程,旨在最大限度地提高顺序程序的执行速度。相比之下,多线程轨迹更关注并行应用程序的执行吞吐量。多线程从大量线程开始,并且线程数量再次随着每一代的增加而增加。当前的一个范例是 NVIDIA GTX680 图形处理单元 (GPU),具有 16,384 个线程,在大量简单、有序的管道中执行。
Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessors [Hwu2008]. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began with two core processors with the number of cores increasing with each semiconductor process generation. A current exemplar is the recent Intel Core i7™ microprocessor with four processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed of sequential programs. In contrast, the many-thread trajectory focuses more on the execution throughput of parallel applications. The many-threads began with a large number of threads, and once again, the number of threads increases with each generation. A current exemplar is the NVIDIA GTX680 graphics processing unit (GPU) with 16,384 threads, executing in a large number of simple, in-order pipelines.
自2003年以来,多线程处理器,尤其是GPU,一直在浮点性能竞赛中处于领先地位。截至2012年,多线程GPU和多核CPU之间的峰值浮点计算吞吐量之比约为10。这些并不一定应用程序速度,但仅仅是执行资源在这些芯片中可能支持的原始速度:2012 年为 1.5 teraflops 与 150 gigaflops 双精度。
Many-threads processors, especially the GPUs, have led the race of floating-point performance since 2003. As of 2012, the ratio of peak floating-point calculation throughput between many-thread GPUs and multicore CPUs is about 10. These are not necessarily application speeds, but are merely the raw speed that the execution resources can potentially support in these chips: 1.5 teraflops versus 150 gigaflops double precision in 2012.
并行执行和顺序执行之间如此巨大的性能差距已经形成了显着的“电势”积累,在某些时候,必须做出一些让步。现在我们已经达到了这一点。迄今为止,这种巨大的性能差距已经促使许多应用程序开发人员将其软件的计算密集型部分转移到 GPU 上执行。毫不奇怪,这些计算密集型部分也是并行编程的主要目标——当有更多工作要做时,就有更多机会在协作的并行工作人员之间分配工作。
Such a large performance gap between parallel and sequential execution has amounted to a significant “electrical potential” build-up, and at some point, something will have to give. We have reached that point now. To date, this large performance gap has already motivated many application developers to move the computationally intensive parts of their software to GPUs for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers.
有人可能会问,为什么多线程 GPU 和通用多核 CPU 之间的峰值性能差距如此之大。答案在于两种类型处理器之间基本设计理念的差异,如图1.1所示。 CPU 的设计针对顺序代码性能进行了优化。它利用复杂的控制逻辑,允许来自单个线程的指令并行执行,甚至不按顺序执行,同时保持顺序执行的外观。更重要的是,提供大容量高速缓冲存储器以减少大型复杂应用的指令和数据访问延迟。控制逻辑和高速缓冲存储器都不会影响峰值计算速度。截至 2012 年,高端通用多核微处理器通常具有六到八个大型处理器内核和数兆字节的片上高速缓存,旨在提供强大的顺序代码性能。
One might ask why there is such a large peak-performance gap between many-threads GPUs and general-purpose multicore CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.1. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip cache memories designed to deliver strong sequential code performance.
图 1.1 CPU 和 GPU 具有根本不同的设计理念。
Figure 1.1 CPUs and GPUs have fundamentally different design philosophies.
内存带宽是另一个重要问题。许多应用程序的速度受到从设备传送数据的速率的限制。内存系统进入处理器。图形芯片的运行内存带宽大约是同时可用的 CPU 芯片的六倍。 2006 年底,GeForce 8800 GTX(或简称为 G80)能够以每秒约 85 GB (GB/s) 的速度将数据移入和移出其主动态随机存取存储器 (DRAM),因为图形帧缓冲区要求和宽松的内存模型(各种系统软件、应用程序和输入/输出 (I/O) 设备期望其内存访问的工作方式)。最新的 GTX680 芯片支持大约 200 GB/s。相比之下,通用处理器必须满足传统操作系统、应用程序和 I/O 设备的要求,这使得内存带宽更难以增加。因此,CPU 在内存带宽方面将在一段时间内继续处于劣势。
Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. In late 2006, GeForce 8800 GTX, or simply G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random-access memory (DRAM) because of graphics frame buffer requirements and the relaxed memory model (the way various system software, applications, and input/output (I/O) devices expect how their memory accesses work). The more recent GTX680 chip supports about 200 GB/s. In contrast, general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. As a result, CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time.
GPU 的设计理念是由快速发展的视频游戏行业塑造的,该行业对高级游戏中每个视频帧执行大量浮点计算的能力施加了巨大的经济压力。这种需求促使 GPU 供应商寻找方法来最大化专用于浮点计算的芯片面积和功率预算。普遍的解决方案是针对大量线程的执行吞吐量进行优化。该设计允许流水线内存通道和算术运算具有较长的延迟,从而节省芯片面积和功耗。内存访问硬件和算术单元的面积和功率的减少允许设计人员在芯片上拥有更多的硬件和算术单元,从而提高总执行吞吐量。
The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput.
应用软件预计将使用大量并行线程来编写。硬件充分利用了大量的当其中一些线程正在等待长延迟内存访问或算术运算时,线程可以找到要做的工作。提供小型高速缓冲存储器来帮助控制这些应用程序的带宽要求,以便访问相同存储器数据的多个线程不需要全部转到DRAM。这种设计风格通常称为面向吞吐量的设计,因为它力求最大化大量线程的总执行吞吐量,同时允许单个线程可能需要更长的执行时间。
The application software is expected to be written with a large number of parallel threads. The hardware takes advantage of the large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM. This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.
另一方面,CPU 的设计目的是最大限度地减少单个线程的执行延迟。大型末级片上缓存旨在捕获频繁访问的数据,并将一些长延迟内存访问转换为短延迟缓存访问。算术单元和操作数数据传送逻辑还被设计成以增加芯片面积和功率的使用为代价来最小化操作的有效延迟。通过减少同一线程内操作的延迟,CPU 硬件可以减少每个单独线程的执行延迟。然而,大型高速缓冲存储器、低延迟算术单元和复杂的操作数传送逻辑消耗了芯片面积和功率,而这些面积和功率本来可以用于提供更多算术执行单元和存储器访问通道。这种设计风格通常称为面向延迟的设计。
The CPUs, on the other hand, are designed to minimize the execution latency of a single thread. Large last-level on-chip caches are designed to capture frequently accessed data and convert some of the long-latency memory accesses into short-latency cache accesses. The arithmetic units and operand data delivery logic are also designed to minimize the effective latency of operation at the cost of increased use of chip area and power. By reducing the latency of operations within the same thread, the CPU hardware reduces the execution latency of each individual thread. However, the large cache memory, low-latency arithmetic units, and sophisticated operand delivery logic consume chip area and power that could be otherwise used to provide more arithmetic execution units and memory access channels. This design style is commonly referred to as latency-oriented design.
现在应该很清楚,GPU 被设计为并行的、面向吞吐量的计算引擎,它们在某些 CPU 设计为能很好执行的任务上表现不佳。对于只有一个或很少线程的程序,具有较低操作延迟的CPU可以获得比GPU高得多的性能。当程序有大量线程时,具有较高执行吞吐量的GPU可以获得比CPU高得多的性能。因此,我们应该预料到许多应用程序会同时使用 CPU 和 GPU,在 CPU 上执行顺序部分,在 GPU 上执行数值密集型部分。这就是 NVIDIA 于 2007 年推出的 CUDA 编程模型旨在支持应用程序的 CPU-GPU 联合执行的原因。1支持 CPU-GPU 联合执行的需求进一步反映在最近的编程模型中,例如 OpenCL(参见第 14 章)、OpenACC(参见第 15 章)和 C++AMP(参见第 18 章)。
It should be clear now that GPUs are designed as parallel, throughput-oriented computing engines and they will not perform well on some tasks on which CPUs are designed to perform well. For programs that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs. When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs. Therefore, one should expect that many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. This is why the CUDA programming model, introduced by NVIDIA in 2007, is designed to support joint CPU–GPU execution of an application.1 The demand for supporting joint CPU–GPU execution is further reflected in more recent programming models such as OpenCL (see Chapter 14), OpenACC (see Chapter 15), and C++AMP (see Chapter 18).
同样重要的是要注意,当应用程序开发人员选择运行其处理器时,性能并不是唯一的决定因素。应用程序。其他几个因素可能更为重要。首先也是最重要的是,所选处理器必须在市场上占有很大的份额,称为处理器的安装基础。原因很简单。大量的客户群最能证明软件开发成本的合理性。在市场份额较小的处理器上运行的应用程序不会有大量的客户群。这一直是传统并行计算系统的一个主要问题,与通用微处理器相比,传统并行计算系统的市场份额可以忽略不计。只有少数由政府和大公司资助的精英应用程序在这些传统的并行计算系统上成功开发。随着多核 GPU 的出现,这种情况发生了变化。由于 GPU 在 PC 市场的受欢迎,其销量已达数亿。几乎所有 PC 都配有 GPU。迄今为止,已有超过 4 亿个支持 CUDA 的 GPU 正在使用中。这是大规模并行计算首次在大众市场产品中变得可行。如此庞大的市场占有率使这些 GPU 对应用程序开发人员来说具有经济吸引力。
It is also important to note that performance is not the only decision factor when application developers choose the processors for running their applications. Several other factors can be even more important. First and foremost, the processors of choice must have a very large presence in the marketplace, referred to as the installed base of the processor. The reason is very simple. The cost of software development is best justified by a very large customer population. Applications that run on a processor with a small market presence will not have a large customer base. This has been a major problem with traditional parallel computing systems that have negligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems. This has changed with many-core GPUs. Due to their popularity in the PC market, GPUs have been sold by the hundreds of millions. Virtually all PCs have GPUs in them. There are more than 400 million CUDA-enabled GPUs in use to date. This is the first time that massively parallel computing is feasible with a mass-market product. Such a large market presence has made these GPUs economically attractive targets for application developers.
另一个重要的决定因素是实用的外形因素和易于访问性。直到 2006 年,并行软件应用程序通常运行在数据中心服务器或部门集群上。但这样的执行环境往往会限制这些应用程序的使用。比如在医学影像这样的应用中,发表一篇基于64节点集群机的论文就可以了。但磁共振成像 (MRI) 机器的实际临床应用是基于 PC 和特殊硬件加速器的某种组合。原因很简单,GE 和西门子等制造商无法将带有计算服务器机架的 MRI 出售到临床环境中,而这在学术部门环境中很常见。事实上,美国国立卫生研究院(NIH)有一段时间拒绝资助并行编程项目:他们认为并行软件的影响将受到限制,因为基于集群的大型机器无法在临床环境中工作。如今,GE 推出配备 GPU 的 MRI 产品,NIH 资助使用 GPU 计算的研究。
Another important decision factor is practical form factors and easy accessibility. Until 2006, parallel software applications usually ran on data center servers or departmental clusters. But such execution environments tend to limit the use of these applications. For example, in an application such as medical imaging, it is fine to publish a paper based on a 64-node cluster machine. But actual clinical applications on magnetic resonance imaging (MRI) machines have been based on some combination of a PC and special hardware accelerators. The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of compute server boxes into clinical settings, while this is common in academic departmental settings. In fact, National Institutes of Health (NIH) refused to fund parallel programming projects for some time: they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting. Today, GE ships MRI products with GPUs and NIH funds research using GPU computing.
选择用于执行数值计算应用程序的处理器时的另一个重要考虑因素是对电气和电子工程师协会 (IEEE) 浮点标准的支持级别。该标准使得跨不同供应商的处理器获得可预测的结果成为可能。虽然早期 GPU 对 IEEE 浮点标准的支持并不强,但自 G80 推出以来,新一代 GPU 的情况也发生了变化。正如我们将要讨论的在第 7 章中,GPU 对 IEEE 浮点标准的支持已经与 CPU 的支持相当。因此,可以预期更多的数值应用程序将被移植到 GPU,并产生与 CPU 相当的结果值。到 2009 年,一个主要的遗留问题是 GPU 的浮点运算单元主要是单精度的。真正需要双精度浮点运算单元的应用程序并不适合 GPU 执行。然而,随着最近的 GPU 的出现,这种情况发生了变化,双精度执行速度接近单精度的一半左右,这是高端 CPU 内核达到的水平。这使得 GPU 适合更多的数值应用。
Yet another important consideration in selecting a processor for executing numeric computing applications is the level of support for the Institute of Electrical and Electronic Engineers’ (IEEE) floating-point standard. The standard makes it possible to have predictable results across processors from different vendors. While the support for the IEEE floating-point standard was not strong in early GPUs, this has also changed for new generations of GPUs since the introduction of the G80. As we will discuss in Chapter 7, GPU support for the IEEE floating-point standard has become comparable with that of the CPUs. As a result, one can expect that more numerical applications will be ported to GPUs and yield comparable result values as the CPUs. Up to 2009, a major remaining issue was that the GPUs’ floating-point arithmetic units were primarily single precision. Applications that truly require double-precision floating-point arithmetic units were not suitable for GPU execution. However, this has changed with the recent GPUs of which the double-precision execution speed approaches about half of that of single precision, a level that high-end CPU cores achieve. This makes the GPUs suitable for even more numerical applications.
直到 2006 年,图形芯片都非常难以使用,因为程序员必须使用相当于图形 API(应用程序编程接口)功能来访问处理器内核,这意味着需要 OpenGL 或 Direct3D 技术来对这些芯片进行编程。更简单地说,计算必须表示为以某种方式绘制像素的函数,以便在这些早期 GPU 上执行。该技术称为 GPGPU(使用图形处理单元的通用编程)。即使使用更高级别的编程环境,底层代码仍然需要适合用于绘制像素的 API。这些 API 限制了人们可以为早期 GPGPU 实际编写的应用程序类型。因此,它并没有成为一种广泛的编程现象。尽管如此,这项技术还是足够令人兴奋,激发了一些英勇的努力和出色的研究成果。
Until 2006, graphics chips were very difficult to use because programmers had to use the equivalent of graphics API (application programming interface) functions to access the processor cores, meaning that OpenGL or Direct3D techniques were needed to program these chips. Stated more simply, a computation must be expressed as a function that paints a pixel in some way to execute on these early GPUs. This technique was called GPGPU (general-purpose programming using a graphics processing unit). Even with a higher-level programming environment, the underlying code still needs to fit into the APIs that are designed to paint pixels. These APIs limit the kinds of applications that one can actually write for early GPGPUs. Consequently, it did not become a widespread programming phenomenon. Nonetheless, this technology was sufficiently exciting to inspire some heroic efforts and excellent research results.
但随着 2007 年 CUDA [NVIDIA2007]的发布,一切都发生了变化。 NVIDIA 开始在其 GPU 芯片上投入硅片区域,以方便并行编程。这不仅仅代表软件的变化;芯片中添加了额外的硬件。在 G80 及其后续并行计算芯片中,CUDA 程序根本不再通过图形接口。相反,硅芯片上的新通用并行编程接口可以满足 CUDA 程序的要求。通用编程接口极大地扩展了可以轻松为 GPU 开发的应用程序类型。此外,所有其他软件层也都进行了重做,以便程序员可以使用熟悉的C/C++编程工具。我们的一些学生尝试使用旧的基于 OpenGL 的编程接口来完成实验室作业,他们的经验帮助他们极大地欣赏了计算应用程序不再需要使用图形 API 的改进。
But everything changed in 2007 with the release of CUDA [NVIDIA2007]. NVIDIA stared to devote silicon areas on their GPU chips to facilitate the ease of parallel programming.This did not represent software changes alone; additional hardware was added to the chips. In the G80 and its successor chips for parallel computing, CUDA programs no longer go through the graphics interface at all. Instead, a new general-purpose parallel programming interface on the silicon chip serves the requests of CUDA programs. The general-purpose programming interface greatly expands the types of applications that one can easily develop for GPUs. Moreover, all the other software layers were redone as well, so that the programmers can use the familiar C/C++ programming tools. Some of our students tried to do their lab assignments using the old OpenGL-based programming interface, and their experience helped them to greatly appreciate the improvements that eliminated the need for using the graphics APIs for computing applications.
图 1.2显示了典型的支持 CUDA 的 GPU 的架构。它被组织成一系列高度线程化的流式多处理器(SM)。在图 1.3中,两个 SM 形成一个构建块。然而,构建块中的 SM 数量可能因一代 CUDA GPU 的不同而异。此外,在图 1.3中,每个 SM 都有多个共享控制逻辑和指令缓存的流处理器 (SP)。目前,每个 GPU 都配备了数 GB 的图形双倍数据速率 (GDDR) DRAM,在图 1.3中称为全局内存。这些 GDDR DRAM 与 CPU 主板上的系统 DRAM 不同,因为它们本质上是用于图形的帧缓冲存储器。对于图形应用程序,它们保存用于 3D 渲染的视频图像和纹理信息。但对于计算而言,它们充当非常高带宽的片外存储器,尽管延迟比典型的系统存储器稍长。对于大规模并行应用程序,更高的带宽可以弥补更长的延迟。
Figure 1.2 shows the architecture of a typical CUDA-capable GPU. It is organized into an array of highly threaded streaming multiprocessors (SMs). In Figure 1.3, two SMs form a building block. However, the number of SMs in a building block can vary from one generation of CUDA GPUs to another generation. Also, in Figure 1.3, each SM has a number of streaming processors (SPs) that share control logic and an instruction cache. Each GPU currently comes with multiple gigabytes of Graphic Double Data Rate (GDDR) DRAM, referred to as global memory in Figure 1.3. These GDDR DRAMs differ from the system DRAMs on the CPU motherboard in that they are essentially the frame buffer memory that is used for graphics. For graphics applications, they hold video images and texture information for 3D rendering. But for computing, they function as very high bandwidth off-chip memory, though with somewhat longer latency than typical system memory. For massively parallel applications, the higher bandwidth makes up for the longer latency.
图 1.2支持 CUDA 的 GPU 架构。
Figure 1.2 Architecture of a CUDA-capable GPU.
图 1.3顺序和并行应用部分的覆盖范围。
Figure 1.3 Coverage of sequential and parallel application portions.
G80 引入了 CUDA 架构,拥有 86.4 GB/s 的内存带宽,以及通过 PCI-Express 第 2 代 (Gen2) 接口与 CPU 核心逻辑的通信链路。通过 PCI-E Gen2,CUDA 应用程序可以以 4 GB/s 的速度将数据从系统内存传输到全局内存,同时以 4 GB/s 的速度将数据上传回系统内存。总共有 8 GB/s。最新的 GPU 使用 PCI-E Gen3,每个方向支持 8 GB/s。随着 GPU 内存大小的增长,应用程序越来越多地将数据保存在全局内存中,只有在需要使用仅在 CPU 上可用的库时才偶尔使用 PCI-E 与 CPU 系统内存进行通信。未来通信带宽也有望随着系统内存的CPU总线带宽的增长而增长。
The G80 introduced the CUDA architecture and had 86.4 GB/s of memory bandwidth, plus a communication link to the CPU core logic over a PCI-Express Generation 2 (Gen2) interface. Over PCI-E Gen2, a CUDA application can transfer data from the system memory to the global memory at 4 GB/s, and at the same time upload data back to the system memory at 4 GB/s. Altogether, there is a combined total of 8 GB/s. More recent GPUs use PCI-E Gen3, which supports 8 GB/s in each direction. As the size of GPU memory grows, applications increasingly keep their data in the global memory and only occasionally use the PCI-E to communicate with the CPU system memory if there is need for using a library that is only available on the CPUs. The communication bandwidth is also expected to grow as the CPU bus bandwidth of the system memory grows in the future.
GTX680 拥有 16,384 个线程,双精度超过 1.5 teraflops。一个好的应用程序通常会在该芯片上同时运行 5,000-12,000 个线程。对于习惯 CPU 多线程的用户,请注意,Intel CPU 每个核心支持两个或四个线程,具体取决于机器型号。然而,CPU 越来越多地使用 SIMD(单指令、多数据)指令来实现高数值性能。 GPU硬件和CPU硬件支持的并行度正在快速提高。因此,在开发计算应用程序时争取高水平的并行性非常重要。
With 16,384 threads, the GTX680 exceeds 1.5 teraflops in double precision. A good application typically runs 5,000–12,000 threads simultaneously on this chip. For those who are used to multithreading in CPUs, note that Intel CPUs support two or four threads, depending on the machine model, per core. CPUs, however, are increasingly used with SIMD (single instruction, multiple data) instructions for high numerical performance. The level of parallelism supported by both GPU hardware and CPU hardware is increasing quickly. It is therefore very important to strive for high levels of parallelism when developing computing applications.
正如我们在1.1 节中所述,大规模并行编程的主要动机是让应用程序在未来的硬件世代中享受持续的速度提升。人们可能会问为什么应用程序将继续要求提高速度。我们今天拥有的许多应用程序似乎运行得足够快。正如我们将在案例研究章节中讨论的那样,当应用程序适合并行执行时,GPU 上的良好实现可以比单个 CPU 内核上的顺序执行实现超过 100 倍 (100×) 的加速。如果应用程序包含我们所说的数据并行性,那么只需几个小时的工作即可实现 10 倍的加速通常是一项简单的任务。对于除此之外的任何内容,我们邀请您继续阅读!
As we stated in Section 1.1, the main motivation for massively parallel programming is for applications to enjoy continued speed increase in future hardware generations. One might ask why applications will continue to demand increased speed. Many applications that we have today seem to be running quite fast enough. As we will discuss in the case study chapters, when an application is suitable for parallel execution, a good implementation on a GPU can achieve more than 100 times (100×) speedup over sequential execution on a single CPU core. If the application includes what we call data parallelism, it’s often a simple task to achieve a 10× speedup with just a few hours of work. For anything beyond that, we invite you to keep reading!
尽管当今世界有无数的计算应用程序,但未来许多令人兴奋的大众市场应用程序都是我们目前所认为的“超级计算应用程序”或超级应用程序。例如,生物学研究界正在越来越多地进入分子水平。显微镜可以说是分子生物学中最重要的仪器,过去依赖光学或电子仪器。但我们使用这些仪器进行的分子水平观察存在局限性。这些限制可以有效地通过结合计算模型来模拟具有传统仪器设置的边界条件的基础分子活动来解决这一问题。通过模拟,我们可以测量更多的细节并测试更多的假设,这比仅使用传统仪器所能想象的要多。在可预见的未来,就可建模的生物系统的规模以及可在可容忍响应时间内模拟的反应时间长度而言,这些模拟将继续受益于计算速度的提高。这些增强将对科学和医学产生巨大影响。
Despite the myriad of computing applications in today’s world, many exciting mass-market applications of the future are what we currently consider “supercomputing applications,” or super-applications. For example, the biology research community is moving more and more into the molecular level. Microscopes, arguably the most important instrument in molecular biology, used to rely on optics or electronic instrumentation. But there are limitations to the molecular-level observations that we can make with these instruments. These limitations can be effectively addressed by incorporating a computational model to simulate the underlying molecular activities with boundary conditions set by traditional instrumentation. With simulation we can measure even more details and test more hypotheses than can ever be imagined with traditional instrumentation alone. These simulations will continue to benefit from the increasing computing speed in the foreseeable future in terms of the size of the biological system that can be modeled and the length of reaction time that can be simulated within a tolerable response time. These enhancements will have tremendous implications to science and medicine.
对于视频和音频编码和操作等应用,请考虑我们对数字高清 (HD) 电视与老式 NTSC 电视的满意度。一旦我们体验了高清电视的细节水平,就很难再回到旧技术了。但请考虑一下高清电视所需的所有处理。这是一个非常并行的过程,3D 成像和可视化也是如此。未来,视图合成和低分辨率视频的高分辨率显示等新功能将需要电视具有更高的计算能力。在消费者层面,我们将开始拥有越来越多的视频和图像处理应用程序,这些应用程序可以改善图片和视频的焦点、照明和其他关键方面。
For applications such as video and audio coding and manipulation, consider our satisfaction with digital high-definition (HD) TV verses older NTSC TV. Once we experience the level of details in an HDTV, it is very hard to go back to older technology. But consider all the processing that’s needed for that HDTV. It is a very parallel process, as are 3D imaging and visualization. In the future, new functionalities such as view synthesis and high-resolution display of low-resolution videos will demand more computing power in the TV. At the consumer level, we will begin to have an increasing number of video and image processing applications that improve the focus, lighting, and other key aspects of the pictures and videos.
更高的计算速度带来的好处之一是更好的用户界面。考虑一下 Apple 的 iPhone™ 界面:尽管 iPhone 的窗口尺寸仍然有限,但与其他手机设备相比,用户可以享受到更加自然的触摸屏界面。毫无疑问,这些设备的未来版本将采用更高清的 3D 视角、结合虚拟和物理空间信息以增强可用性的应用程序以及基于语音和计算机视觉的界面,从而需要更高的计算速度。
Among the benefits offered by more computing speed are much better user interfaces. Consider Apple’s iPhone™ interfaces: the user enjoys a much more natural interface with the touchscreen than other cell phone devices even though the iPhone still has a limited-size window. Undoubtedly, future versions of these devices will incorporate higher-definition, 3D perspectives, applications that combine virtual and physical space information for enhanced usability, and voice-based and computer vision–based interfaces, requiring even more computing speed.
消费电子游戏领域也正在进行类似的发展。过去,在游戏中驾驶汽车实际上只是一组预先安排好的场景。如果你的车撞到了障碍物,你的车的路线不会改变,只有游戏分数会改变。你的车轮没有弯曲或损坏,驾驶不再困难,无论你是否撞到了车轮,甚至丢了一个车轮。随着计算速度的提高,游戏可以基于动态模拟而不是预先安排的场景。我们可以期待在未来看到更多这样的现实效果:事故会损坏你的车轮,你的在线驾驶体验将更加真实。众所周知,物理效应的真实建模和模拟需要非常大量的计算能力。
Similar developments are underway in consumer electronic gaming. In the past, driving a car in a game was in fact simply a prearranged set of scenes. If your car bumped into an obstacle, the course of your vehicle did not change, only the game score changed. Your wheels were not bent or damaged, and it was no more difficult to drive, regardless of whether you bumped your wheels or even lost a wheel. With increased computing speed, the games can be based on dynamic simulation rather than prearranged scenes. We can expect to see more of these realistic effects in the future: accidents will damage your wheels and your online driving experience will be much more realistic. Realistic modeling and simulation of physics effects are known to demand very large amounts of computing power.
我们在这里提到的所有新应用程序都涉及以不同方式和不同级别模拟物理并发世界,并处理大量数据。事实上,处理海量数据的问题非常普遍,以至于“大数据”一词已成为家喻户晓的词。有了如此大量的数据,大部分计算可以在数据的不同部分上并行完成,尽管它们必须在某些时候进行协调。在大多数情况下,数据交付的有效管理会对并行应用程序的可实现速度产生重大影响。虽然这样做的技术通常为一些每天使用此类应用程序的专家所熟知,但绝大多数应用程序开发人员可以从对这些技术的更直观的理解和实际工作知识中受益。
All the new applications that we mention here involve simulating a physical, concurrent world in different ways and at different levels, with tremendous amounts of data being processed. In fact, the problem of handling massive amounts of data is so prevalent that the term big data has become a household word. And with this huge quantity of data, much of the computation can be done on different parts of the data in parallel, although they will have to be reconciled at some point. In most cases, effective management of data delivery can have a major impact on the achievable speed of a parallel application. While techniques for doing so are often well known to a few experts who work with such applications on a daily basis, the vast majority of application developers can benefit from more intuitive understanding and practical working knowledge of these techniques.
我们的目标是以直观的方式向应用程序开发人员展示数据管理技术,这些开发人员的正规教育可能不是计算机科学或计算机工程。我们还旨在提供许多实用的代码示例和实践练习,帮助读者获得工作知识,这需要一个实用的编程模型来促进并行实现并支持数据交付的正确管理。 CUDA 提供了这样的编程模型,并且已经过大型开发者社区的充分测试。
We aim to present the data management techniques in an intuitive way to application developers whose formal education may not be in computer science or computer engineering. We also aim to provide many practical code examples and hands-on exercises that help readers acquire working knowledge, which requires a practical programming model that facilitates parallel implementation and supports proper management of data delivery. CUDA offers such a programming model and has been well tested by a large developer community.
并行化应用程序可以预期加速多少倍?这取决于应用程序中可以并行化的部分。如果可并行部分所花费的时间百分比为 30%,则并行部分加速 100 倍将减少执行时间不超过 29.7%。整个应用程序的加速仅约为 1.4 倍。事实上,即使并行部分无限加速,也只能减少 30% 的执行时间,实现不超过 1.43 倍的加速。另一方面,如果 99% 的执行时间都在并行部分,则 100 倍的加速会将应用程序执行时间减少到原始时间的 1.99%。这使整个应用程序加速了 50 倍。因此,对于大规模并行处理器来说,应用程序的绝大多数执行都在并行部分进行,以有效加速其执行速度,这一点非常重要。
How many times speedup can be expected from parallelizing an application? It depends on the portion of the application that can be parallelized. If the percentage of time spent in the part that can be parallelized is 30%, a 100× speedup of the parallel portion will reduce the execution time by no more than 29.7%. The speedup for the entire application will be only about 1.4×. In fact, even an infinite amount of speedup in the parallel portion can only slash 30% off execution time, achieving no more than 1.43× speedup. On the other hand, if 99% of the execution time is in the parallel portion, a 100× speedup will reduce the application execution to 1.99% of the original time. This gives the entire application a 50× speedup. Therefore, it is very important that an application has the vast majority of its execution in the parallel portion for a massively parallel processor to effectively speed up its execution.
研究人员已经在某些应用程序中实现了超过 100 倍的加速。然而,这通常只有经过广泛的算法增强后的优化和调优,使得99.9%以上的应用程序执行时间是并行执行。在实践中,应用程序的直接并行化通常会使内存 (DRAM) 带宽饱和,导致仅获得约 10 倍的加速。诀窍在于找出如何解决内存带宽限制,这涉及到进行多种转换之一,以利用专门的 GPU 片上内存来大幅减少对 DRAM 的访问次数。然而,我们必须进一步优化代码,以克服片上内存容量有限等限制。本书的一个重要目标是帮助读者充分理解这些优化并熟练掌握它们。
Researchers have achieved speedups of more than 100× for some applications. However, this is typically achieved only after extensive optimization and tuning after the algorithms have been enhanced, so that more than 99.9% of the application execution time is in parallel execution. In practice, straightforward parallelization of applications often saturates the memory (DRAM) bandwidth, resulting in only about a 10× speedup. The trick is to figure out how to get around memory bandwidth limitations, which involves doing one of many transformations to utilize specialized GPU on-chip memories to drastically reduce the number of accesses to the DRAM. One must, however, further optimize the code to get around limitations such as limited on-chip memory capacity. An important goal of this book is to help readers fully understand these optimizations and become skilled in them.
请记住,相对于单核 CPU 执行所实现的加速水平也可以反映 CPU 对应用程序的适用性:在某些应用程序中,CPU 的性能非常好,这使得使用 GPU 加速性能变得更加困难。大多数应用程序都有一些可以由 CPU 更好地执行的部分。因此,必须给CPU一个公平的执行机会,并确保编写的代码能够让GPU补充CPU的执行,从而正确地利用CPU-GPU组合系统的异构并行计算能力。这正是 CUDA 编程模型所提倡的,我们将在书中进一步解释。
Keep in mind that the level of speedup achieved over single-core CPU execution can also reflect the suitability of the CPU to the application: in some applications, CPUs perform very well, making it harder to speed up performance using a GPU. Most applications have portions that can be much better executed by the CPU. Thus, one must give the CPU a fair chance to perform and make sure that code is written so that GPUs complement CPU execution, thus properly exploiting the heterogeneous parallel computing capabilities of the combined CPU–GPU system. This is precisely what the CUDA programming model promotes, as we will further explain in the book.
图 1.3说明了典型应用的主要部分。实际应用程序的大部分代码往往是连续的。这些连续的部分被描述为桃子的“核”区域:试图将并行计算技术应用于这些部分就像咬进桃核一样——这种感觉不太好!这些部分很难并行化。 CPU 在这些部分往往做得非常好。好消息是,这些部分虽然占用了大部分代码,但往往只占超级应用程序执行时间的一小部分。
Figure 1.3 illustrates the main parts of a typical application. Much of a real application’s code tends to be sequential. These sequential parts are illustrated as the “pit” area of the peach: trying to apply parallel computing techniques to these portions is like biting into the peach pit—not a good feeling! These portions are very hard to parallelize. CPUs tend to do a very good job on these portions. The good news is that these portions, although they can take up a large portion of the code, tend to account for only a small portion of the execution time of super-applications.
然后是我们所说的“桃肉”部分。这些部分很容易并行化,一些早期的图形应用程序也是如此。异构计算系统中的并行编程可以极大地提高这些应用程序的质量。如图1.3所示,早期的 GPGPU 仅覆盖肉类部分的一小部分,这类似于最令人兴奋的应用程序的一小部分。正如我们将看到的,CUDA 编程模型旨在涵盖令人兴奋的应用程序的大部分桃肉部分。事实上,正如我们将在第 20 章中讨论的那样,这些编程模型及其底层硬件仍在快速发展,以实现更大的应用程序部分的高效并行化。
Then come what we call the “peach meat” portions. These portions are easy to parallelize, as are some early graphics applications. Parallel programming in heterogeneous computing systems can drastically improve the quality of these applications. As illustrated in Figure 1.3, early GPGPUs cover only a small portion of the meat section, which is analogous to a small portion of the most exciting applications. As we will see, the CUDA programming models are designed to cover a much larger section of the peach meat portions of exciting applications. In fact, as we will discuss in Chapter 20, these programming models and their underlying hardware are still evolving at a fast pace to enable efficient parallelization of even larger sections of applications.
在过去的几十年里,已经提出了许多并行编程语言和模型Mattson2004 ]。最广泛使用的是用于可扩展集群计算的消息传递接口 (MPI) [MPI2009] ,以及用于共享内存多处理器系统的OpenMP [Open2005] 。两者都已成为主要计算机供应商支持的标准化编程接口。 OpenMP 实现由编译器和运行时组成。程序员向 OpenMP 编译器指定有关循环的指令(命令)和编译指示(提示)。通过这些指令和编译指示,OpenMP 编译器可以生成并行代码。运行时系统通过管理并行线程和资源来支持并行代码的执行。 OpenMP 最初是为 CPU 执行而设计的。最近,多个计算机供应商提出并支持一种称为 OpenACC 的变体(参见第 15 章),用于对异构计算系统进行编程。
Many parallel programming languages and models have been proposed in the past several decades Mattson2004]. The ones that are the most widely used are Message Passing Interface (MPI) [MPI2009] for scalable cluster computing, and OpenMP [Open2005] for shared-memory multiprocessor systems. Both have become standardized programming interfaces supported by major computer vendors. An OpenMP implementation consists of a compiler and a runtime. A programmer specifies directives (commands) and pragmas (hints) about a loop to the OpenMP compiler. With these directives and pragmas, OpenMP compilers generate parallel code. The runtime system supports the execution of the parallel code by managing parallel threads and resources. OpenMP was originally designed for CPU execution. More recently, a variation called OpenACC (see Chapter 15) has been proposed and supported by multiple computer vendors for programming heterogeneous computing systems.
OpenACC 的主要优点是它提供编译器自动化和运行时支持,以从程序员那里抽象出许多并行编程细节。这种自动化和抽象可以帮助使应用程序代码在不同供应商生产的系统以及同一供应商的不同代系统之间更加可移植。这就是我们在第 15 章中教授 OpenACC 编程的原因。然而,在 OpenACC 中进行有效的编程仍然需要程序员了解所涉及的所有详细的并行编程概念。由于 CUDA 为程序员提供了对这些并行编程细节的明确控制,因此即使对于那些想要使用 OpenMP 和 OpenACC 作为主要编程接口的人来说,它也是一个极好的学习工具。此外,根据我们的经验,OpenACC 编译器仍在不断发展和改进。许多程序员可能需要使用 CDUA 风格的接口来处理 OpenACC 编译器无法满足的部分。
The major advantage of OpenACC is that it provides compiler automation and runtime support for abstracting away many parallel programming details from programmers. Such automation and abstraction can help make the application code more portable across systems produced by different vendors, as well as different generations of systems from the same vendor. This is why we teach OpenACC programming in Chapter 15. However, effective programming in OpenACC still requires the programmers to understand all the detailed parallel programming concepts involved. Because CUDA gives programmers explicit control of these parallel programming details, it is an excellent learning vehicle even for someone who would like to use OpenMP and OpenACC as their primary programming interface. Furthermore, from our experience, OpenACC compilers are still evolving and improving. Many programmers will likely need to use CDUA-style interfaces for parts where OpenACC compilers fall short.
MPI是集群中的计算节点不共享内存的模型[MPI2009]。所有的数据共享和交互都必须通过显式的消息传递来完成。 MPI 在高性能计算 (HPC) 领域取得了成功。用MPI编写的应用程序已在超过10万个节点的集群计算系统上成功运行。如今,许多 HPC 集群都采用异构 CPU-GPU 节点。虽然 CUDA 是与每个节点的有效接口,但大多数应用程序开发人员需要使用 MPI 在集群级别进行编程。因此,并行很重要HPC 程序员了解如何进行 MPI/CUDA 联合编程,这将在第 19 章中介绍。
MPI is a model where computing nodes in a cluster do not share memory [MPI2009]. All data sharing and interaction must be done through explicit message passing. MPI has been successful in high-performance computing (HPC). Applications written in MPI have run successfully on cluster computing systems with more than 100,000 nodes. Today, many HPC clusters employ heterogeneous CPU–GPU nodes. While CUDA is an effective interface with each node, most application developers need to use MPI to program at the cluster level. Therefore, it is important that a parallel programmer in HPC understands how to do joint MPI/CUDA programming, which is presented in Chapter 19.
然而,由于计算节点之间缺乏共享内存,将应用程序移植到 MPI 所需的工作量可能相当大。程序员需要执行域分解,将输入和输出数据划分到集群节点中。基于域分解,程序员还需要调用消息发送和接收函数来管理节点之间的数据交换。另一方面,CUDA 为 GPU 中的并行执行提供共享内存来解决这一难题。至于CPU和GPU通信,CUDA之前在CPU和GPU之间提供非常有限的共享内存能力。程序员需要以类似于“单边”消息传递的方式管理 CPU 和 GPU 之间的数据传输。异构计算系统(例如 GMAC [GCN2010]和 CUDA 4.0)中的全局地址空间和自动数据传输的新运行时支持现已推出。借助 GMAC,CUDA 或 OpenCL 程序员可以将 C 变量和数据结构声明为在 CPU 和 GPU 之间共享。 GMAC 运行时保持一致性,并根据需要代表程序员自动执行优化的数据传输操作。这种支持显着降低了与计算和 I/O 活动重叠的数据传输所涉及的 CUDA 和 OpenCL 编程复杂性。
The amount of effort needed to port an application into MPI, however, can be quite high due to the lack of shared memory across computing nodes. The programmer needs to perform domain decomposition to partition the input and output data into cluster nodes. Based on the domain decomposition, the programmer also needs to call message sending and receiving functions to manage the data exchange between nodes. CUDA, on the other hand, provides shared memory for parallel execution in the GPU to address this difficulty. As for CPU and GPU communication, CUDA previously provided very limited shared memory capability between the CPU and the GPU. The programmers needed to manage the data transfer between the CPU and the GPU in a manner similar to the “one-sided” message passing. New runtime support for global address space and automated data transfer in heterogeneous computing systems, such as GMAC [GCN2010] and CUDA 4.0, are now available. With GMAC, a CUDA or OpenCL programmer can declare C variables and data structures as shared between CPU and GPU. The GMAC runtime maintains coherence and automatically performs optimized data transfer operations on behalf of the programmer on an as-needed basis. Such support significantly reduces the CUDA and OpenCL programming complexity involved in overlapping data transfer with computation and I/O activities.
2009年,包括Apple、Intel、AMD/ATI和NVIDIA在内的几个主要行业参与者联合开发了一种标准化编程模型,称为开放计算语言(OpenCL)[Khronos2009]。与 CUDA 类似,OpenCL 编程模型定义了语言扩展和运行时 API,以允许程序员管理大规模并行处理器中的并行性和数据交付。与 CUDA 相比,OpenCL 更多地依赖于 API,而更少地依赖于 CUDA 的语言扩展。这使得供应商能够快速调整其现有的编译器和工具来处理 OpenCL 程序。 OpenCL 是一种标准化的编程模型,用 OpenCL 开发的应用程序无需修改即可在所有支持 OpenCL 语言扩展和 API 的处理器上正确运行。然而,人们可能需要修改应用程序才能实现新处理器的高性能。
In 2009, several major industry players, including Apple, Intel, AMD/ATI, and NVIDIA, jointly developed a standardized programming model called Open Compute Language (OpenCL) [Khronos2009]. Similar to CUDA, the OpenCL programming model defines language extensions and runtime APIs to allow programmers to manage parallelism and data delivery in massively parallel processors. In comparison to CUDA, OpenCL relies more on APIs and less on language extensions than CUDA. This allows vendors to quickly adapt their existing compilers and tools to handle OpenCL programs. OpenCL is a standardized programming model in that applications developed in OpenCL can run correctly without modification on all processors that support the OpenCL language extensions and API. However, one will likely need to modify the applications to achieve high performance for a new processor.
熟悉 OpenCL 和 CUDA 的人都知道,OpenCL 和 CUDA 的关键概念和功能之间存在显着的相似性。也就是说,CUDA程序员可以以最小的努力学习OpenCL编程。更重要的是,几乎所有技术使用 CUDA 学到的知识可以轻松应用到 OpenCL 编程中。因此,我们在第 14 章介绍 OpenCL ,并解释如何将本书中的关键概念应用到 OpenCL 编程中。
Those who are familiar with both OpenCL and CUDA know that there is a remarkable similarity between the key concepts and features of OpenCL and those of CUDA. That is, a CUDA programmer can learn OpenCL programming with minimal effort. More importantly, virtually all techniques learned using CUDA can be easily applied to OpenCL programming. Therefore, we introduce OpenCL in Chapter 14 and explain how one can apply the key concepts in this book to OpenCL programming.
我们的主要目标是教读者如何对大规模并行处理器进行编程以实现高性能,并且我们的方法不需要大量的硬件专业知识。有人曾经说过,如果不关心性能,并行编程是很容易的。您实际上可以在一小时内编写一个并行程序。但我们将用很多篇幅来讨论开发高性能并行程序的技术。而且,我们相信,一旦您形成正确的洞察力并以正确的方式去做,事情就会变得容易。特别是,我们将重点关注计算思维技术,使您能够以适合高性能并行计算的方式思考问题。
Our primary goal is to teach the readers how to program massively parallel processors to achieve high performance, and our approach will not require a great deal of hardware expertise. Someone once said that if you don’t care about performance, parallel programming is very easy. You can literally write a parallel program in an hour. But we’re going to dedicate many pages to techniques for developing high-performance parallel programs. And, we believe that it will become easy once you develop the right insight and go about it the right way. In particular, we will focus on computational thinking techniques that will enable you to think about problems in ways that are amenable to high-performance parallel computing.
请注意,硬件架构功能有限制。大多数处理器上的高性能并行编程需要了解硬件工作原理。我们可能需要 10 年或更长时间才能构建工具和机器,以便大多数程序员可以在没有这些知识的情况下工作。即使我们有这样的工具,我们怀疑对硬件有更多了解的程序员将能够比那些不了解硬件的程序员更有效地使用这些工具。然而,我们不会将计算机体系结构作为一个单独的主题来教授。相反,我们将教授基本的计算机体系结构知识,作为高性能并行编程技术讨论的一部分。
Note that hardware architecture features have constraints. High-performance parallel programming on most processors will require some knowledge of how the hardware works. It will probably take 10 or more years before we can build tools and machines so that most programmers can work without this knowledge. Even if we have such tools, we suspect that programmers with more knowledge of the hardware will be able to use the tools in a much more effective way than those who do not. However, we will not be teaching computer architecture as a separate topic. Instead, we will teach the essential computer architecture knowledge as part our discussions on high-performance parallel programming techniques.
我们的第二个目标是教授并行编程的正确功能和可靠性,这是并行计算中的一个微妙问题。过去从事并行系统工作的人都知道,仅实现初始性能是不够的。挑战在于以一种可以调试代码并支持用户的方式来实现它。 CUDA 编程模型鼓励使用简单形式的屏障同步和内存一致性来管理并行性。我们将证明,通过关注数据并行性,人们可以在其应用程序中实现高性能和高可靠性。
Our second goal is to teach parallel programming for correct functionality and reliability, which constitute a subtle issue in parallel computing. Those who have worked on parallel systems in the past know that achieving initial performance is not enough. The challenge is to achieve it in such a way that you can debug the code and support users. The CUDA programming model encourages the use of a simple form of barrier synchronization and memory consistency for managing parallelism. We will show that by focusing on data parallelism, one can achieve both high performance and high reliability in their applications.
我们的第三个目标是通过探索并行编程方法来实现未来硬件各代的可扩展性,以便未来的机器(将越来越并行)可以比今天的机器更快地运行代码。我们希望帮助您掌握并行编程,所以您的程序可以扩展到新一代机器的性能水平。这种可扩展性的关键是规范和本地化内存数据访问,以最大限度地减少关键资源的消耗以及访问和更新数据结构的冲突。
Our third goal is scalability across future hardware generations by exploring approaches to parallel programming such that future machines, which will be more and more parallel, can run your code faster than today’s machines. We want to help you master parallel programming so that your programs can scale up to the level of performance of new generations of machines. The key to such scalability is to regularize and localize memory data accesses to minimize consumption of critical resources and conflicts in accessing and updating data structures.
实现这些目标需要大量的技术知识,因此我们将在本书中介绍相当多的并行编程原理和模式。然而,我们不能保证涵盖所有这些内容,因此我们选择了最有用且经过充分验证的技术来详细介绍。为了补充您的知识和专业知识,我们提供了推荐文献列表。现在我们准备向您提供本书其余部分的快速概述。
Much technical knowledge will be required to achieve these goals, so we will cover quite a few principles and patterns of parallel programming in this book. We cannot guarantee that we will cover all of them, however, so we have selected the most useful and well-proven techniques to cover in detail. To complement your knowledge and expertise, we include a list of recommended literature. We are now ready to give you a quick overview of the rest of the book.
第 2 章回顾了 GPU 计算的历史。首先简要总结了图形硬件向更多可编程性的发展,然后讨论了历史上的 GPGPU 运动。 CUDA 编程模型当前的许多功能和限制都源于这些历史性的发展。充分了解这些历史性发展将有助于读者更好地了解硬件发展的现状和未来趋势,这些发展将继续影响将从 CUDA 中受益的应用程序类型。
Chapter 2 reviews the history of GPU computing. It starts with a brief summary of the evolution of graphics hardware toward more programmability and then discusses the historical GPGPU movement. Many of the current features and limitations of the CUDA programming model find their root in these historic developments. A good understanding of these historic developments will help readers better understand the current state and the future trends of hardware evolution that will continue to impact the types of applications that will benefit from CUDA.
第3章介绍数据并行性和CUDA C编程。本章基于学生之前有过 C 编程经验的事实。它首先介绍了 CUDA C 作为 C 的简单、小型扩展,支持异构 CPU-GPU 联合计算和广泛使用的 SPMD(单程序、多数据)并行编程模型。然后,它涵盖了以下方面的思维过程:(1) 确定应用程序中要并行化的部分; (2)隔离并行化代码要使用的数据,利用API(应用程序编程接口)函数在并行计算设备上分配内存; (3)使用API函数向并行计算设备传输数据; (4)开发内核函数,由并行部分的线程执行; (5)启动内核函数以供并行线程执行; (6) 最终通过 API 函数调用将数据传输回主机处理器。
Chapter 3 introduces data parallelism and the CUDA C programming. This chapter relies on the fact that students have had previous experience with C programming. It first introduces CUDA C as a simple, small extension to C that supports heterogeneous CPU–GPU joint computing and the widely used SPMD (single program, multiple data) parallel programming model. It then covers the thought process involved in (1) identifying the part of application programs to be parallelized; (2) isolating the data to be used by the parallelized code, using an API (Application Programming Interface) function to allocate memory on the parallel computing device; (3) using an API function to transfer data to the parallel computing device; (4) developing a kernel function that will be executed by threads in the parallelized part; (5) launching a kernel function for execution by parallel threads; and (6) eventually transferring the data back to the host processor with an API function call.
虽然第 3 章的目标是教授 CUDA C 编程模型的足够概念,以便学生能够编写简单的并行 CUDA C 程序,但它实际上涵盖了以下几个基本技能:基于任何并行编程模型开发并行应用程序。我们使用向量加法的运行示例来使本章具体化。我们还将 CUDA 与其他并行编程模型(包括 OpenMP 和 OpenCL)进行了比较。
While the objective of Chapter 3 is to teach enough concepts of the CUDA C programming model so that the students can write a simple parallel CUDA C program, it actually covers several basic skills needed to develop a parallel application based on any parallel programming model. We use a running example of vector addition to make this chapter concrete. We also compare CUDA with other parallel programming models including OpenMP and OpenCL.
第 4 章介绍了 CUDA 并行执行模型的更多细节。它对线程的创建、组织、资源绑定、数据绑定和调度提供了足够的洞察力,使读者能够使用 CUDA C 实现复杂的计算,并推理其 CUDA 代码的性能行为。第 5 章专门介绍可用于保存 CUDA 变量以管理数据传输和提高程序执行速度的特殊存储器。
Chapter 4 presents more details of the parallel execution model of CUDA. It gives enough insight into the creation, organization, resource binding, data binding, and scheduling of threads to enable readers to implement sophisticated computation using CUDA C and reason about the performance behavior of their CUDA code. Chapter 5 is dedicated to the special memories that can be used to hold CUDA variables for managing data delivery and improving program execution speed.
第 6 章介绍了当前 CUDA 硬件中的几个重要的性能考虑因素。特别是,它提供了有关线程执行、内存数据访问和资源分配的更多详细信息。这些细节构成了程序员推理其组织计算和数据决策结果的概念基础。
Chapter 6 presents several important performance considerations in current CUDA hardware. In particular, it gives more details in thread execution, memory data accesses, and resource allocation. These details form the conceptual basis for programmers to reason about the consequence of their decisions on organizing their computation and data.
第 7 章介绍浮点数格式、精度和准确度的概念。它说明了为什么不同的并行执行安排会导致不同的输出值。它还教授数值稳定性的概念以及在并行算法中保持数值稳定性的实用技术。
Chapter 7 introduces the concepts of floating-point number format, precision, and accuracy. It shows why different parallel execution arrangements can result in different output values. It also teaches the concept of numerical stability and practical techniques for maintaining numerical stability in parallel algorithms.
第 8 - 10章介绍了三种重要的并行计算模式,使读者能够更深入地了解并行编程技术和并行执行机制。第 8 章介绍了卷积,这是一种常用的并行计算模式,需要仔细管理数据访问局部性。我们还使用这种模式在现代 GPU 中引入恒定内存和缓存。第9章介绍了前缀和或扫描,这是一种重要的并行计算模式,它将顺序计算转换为并行计算。我们还使用这种模式来引入并行算法中工作效率的概念。第 10 章介绍了稀疏矩阵计算,这是一种用于处理非常大的数据集的模式。本章向读者介绍重新排列数据以实现更高效的并行访问的概念:填充、排序、转置和正则化。
Chapter 8-10 present three important parallel computation patterns that give readers more insight into parallel programming techniques and parallel execution mechanisms. Chapter 8 presents convolution, a frequently used parallel computing pattern that requires careful management of data access locality. We also use this pattern to introduce constant memory and caching in modern GPUs. Chapter 9 presents prefix sum, or scan, an important parallel computing pattern that coverts sequential computation into parallel computation. We also use this pattern to introduce the concept of work efficiency in parallel algorithms. Chapter 10 presents sparse matrix computation, a pattern used for processing very large data sets. This chapter introduces readers to the concepts of rearranging data for more efficient parallel access: padding, sorting, transposition, and regularization.
虽然这些章节基于 CUDA,但它们可以帮助读者建立一般并行编程的基础。我们相信,当我们自下而上学习时,人类理解得最好。也就是说,我们必须首先在特定编程模型的上下文中学习概念,这为我们将知识推广到其他编程模型时提供了坚实的基础。当我们这样做时,我们可以从 CUDA 模型中汲取具体经验。深入体验与CUDA模型的接触也使我们变得成熟,这将帮助我们学习甚至可能与CUDA模型无关的概念。
While these chapters are based on CUDA, they help readers build up the foundation for parallel programming in general. We believe that humans understand best when we learn from the bottom up. That is, we must first learn the concepts in the context of a particular programming model, which provides us with solid footing when we generalize our knowledge to other programming models. As we do so, we can draw on our concrete experience from the CUDA model. An in-depth experience with the CUDA model also enables us to gain maturity, which will help us learn concepts that may not even be pertinent to the CUDA model.
第 11 章和第12 章是两个实际应用程序的案例研究,引导读者了解并行化和优化其应用程序以显着加速的思维过程。对于每个应用程序,我们首先确定制定并行执行基本结构的替代方法,然后推理每个替代方案的优点和缺点。然后,我们完成实现高性能所需的代码转换步骤。这两章帮助读者将前面章节中的所有材料放在一起,为自己的应用程序开发项目做好准备。
Chapters 11 and 12 are case studies of two real applications, which take readers through the thought process of parallelizing and optimizing their applications for significant speedups. For each application, we start by identifying alternative ways of formulating the basic structure of the parallel execution and follow up with reasoning about the advantages and disadvantages of each alternative. We then go through the steps of code transformation needed to achieve high performance. These two chapters help readers put all the materials from the previous chapters together and prepare for their own application development projects.
第13章将并行编程技术概括为问题分解原理、算法策略和计算思维。它通过涵盖组织程序的计算任务以使它们可以并行完成的概念来实现这一点。我们首先讨论将抽象科学概念组织成计算任务的转化过程,这是生产高质量应用软件(串行或并行)的重要的第一步。然后讨论并行算法结构及其对应用程序性能的影响,这是基于 CUDA 的性能调优经验。本章最后讨论了并行编程风格和模型,使读者能够将他们的知识置于更广泛的背景中。通过本章,读者可以开始从 SPMD 编程风格推广到其他并行编程风格,例如 OpenMP 中的循环并行和 p 线程编程中的 fork-join。尽管我们不会讨论这些替代的并行编程风格,但我们希望读者能够利用在本书中获得的基础来学习其中任何一种编程风格。
Chapter 13 generalizes the parallel programming techniques into problem decomposition principles, algorithm strategies, and computational thinking. It does so by covering the concept of organizing the computation tasks of a program so that they can be done in parallel. We start by discussing the translational process of organizing abstract scientific concepts into computational tasks, which is an important first step in producing quality application software, serial or parallel. It then discusses parallel algorithm structures and their effects on application performance, which is grounded in the performance tuning experience with CUDA. The chapter concludes with a treatment of parallel programming styles and models, enabling readers to place their knowledge in a wider context. With this chapter, readers can begin to generalize from the SPMD programming style to other styles of parallel programming, such as loop parallelism in OpenMP and fork-join in p-thread programming. Although we do not go into these alternative parallel programming styles, we expect that readers will be able to learn to program in any of them with the foundation they gain in this book.
第14章从CUDA程序员的角度介绍了OpenCL编程模型。读者会发现 OpenCL 与 CUDA 极其相似。最重要的区别在于 OpenCL 使用 API 函数来实现内核启动和线程识别等功能。 API函数的使用使得OpenCL使用起来更加繁琐。尽管如此,CUDA 程序员拥有理解和编写 OpenCL 程序所需的所有知识和技能。事实上,我们认为教授 OpenCL 编程的最佳方法是先教授 CUDA。我们用一章来演示这一点,该章将所有主要 OpenCL 功能与其相应的 CUDA 功能联系起来。我们还通过将简单的 CUDA 示例改编为 OpenCL 来说明这些功能的使用。
Chapter 14 introduces the OpenCL programming model from a CUDA programmer’s perspective. Readers will find OpenCL to be extremely similar to CUDA. The most important difference arises from OpenCL’s use of API functions to implement functionalities such as kernel launching and thread identification. The use of API functions makes OpenCL more tedious to use. Nevertheless, a CUDA programmer has all the knowledge and skills needed to understand and write OpenCL programs. In fact, we believe that the best way to teach OpenCL programming is to teach CUDA first. We demonstrate this with a chapter that relates all major OpenCL features to their corresponding CUDA features. We also illustrate the use of these features by adapting our simple CUDA examples into OpenCL.
第 15 章介绍了 OpenACC 编程接口。它展示了如何使用指令和编译指示告诉编译器可以并行化循环,并在需要时指示编译器如何并行化循环。它还使用具体示例来说明如何利用该接口并使代码在供应商系统之间更加可移植。通过本书中的基本概念,读者会发现 OpenACC 编程指令和编译指示很容易学习和掌握。
Chapter 15 presents the OpenACC programming interface. It shows how to use directives and pragmas to tell the compiler that a loop can be parallelized, and if desirable, instruct the compiler how to parallelize the loop. It also uses concrete examples to illustrate how one can take advantage of the interface and make their code more portable across vendor systems. With the foundational concepts in this book, readers will find the OpenACC programming directives and pragmas easy to learn and master.
第 16 章介绍了 Thrust,这是一个面向生产力的 C++ 库,用于构建 CUDA 应用程序。本章展示了如何使用现代面向对象编程接口和技术来提高并行编程环境中的生产力。特别是,它展示了通用编程和抽象如何显着减少应用程序的工作量和代码复杂性。
Chapter 16 covers Thrust, a productivity-oriented C++ library for building CUDA applications. This is a chapter that shows how modern object-oriented programming interfaces and techniques can be used to increase productivity in a parallel programming environment. In particular, it shows how generic programming and abstractions can significantly reduce the efforts and code complexity of applications.
第17章介绍了CUDA FORTRAN,这是一个基于CUDA模型支持FORTRAN风格编程的接口。使用 CUDA C 学习的所有概念和技术都可以在 CUDA 中编程时应用。此外,CUDA FORTRAN 接口对多维数组有强大的支持,使 3D 模型的编程更具可读性。它还采用 FORTRAN 数组数据布局约定,并且可以更好地与用 FORTRAN 编写的现有应用程序配合使用。
Chapter 17 presents CUDA FORTRAN, an interface that supports FORTRAN-style programming based on the CUDA model. All concepts and techniques learned using CUDA C can be applied when programming in CUDA. In addition, the CUDA FORTRAN interface has strong support for multidimensional arrays that make programming of 3D models much more readable. It also assumes the FORTRAN array data layout convention and works better with an existing application written in FORTRAN.
第 18 章概述了 Microsoft 的 C++AMP 编程接口。该编程接口使用组合语言扩展和 API 支持来支持数据并行计算模式。它允许程序员使用 C++ 功能来提高生产力。与 OpenACC 一样,C++AMP 抽象出了一些特定于硬件的并行编程细节,因此代码可能更容易跨供应商系统移植。
Chapter 18 is an overview of the C++AMP programming interface from Microsoft. This programming interface uses a combination language extension and API support to support data-parallel computation patterns. It allows programmers to use C++ features to increase their productivity. Like OpenACC, C++AMP abstracts away some of the parallel programming details that are specific to the hardware so the code is potentially more portable across vendor systems.
第 19 章介绍了 MPI/CUDA 联合编程。我们介绍了程序员需要了解的关键 MPI 概念,以将其异构应用程序扩展到集群环境中的多个节点。特别是,我们将重点关注将 CUDA 内核扩展到多个节点的背景下的域分区、点对点通信和集体通信。
Chapter 19 presents an introduction to joint MPI/CUDA programming. We cover the key MPI concepts that a programmer needs to understand to scale their heterogeneous applications to multiple nodes in a cluster environment. In particular, we will focus on domain partitioning, point-to-point communication, collective communication in the context of scaling a CUDA kernel into multiple nodes.
第 20 章介绍了 Kepler GPU 及其后续产品中可用的动态并行功能。动态并行性可能有助于复杂算法的实现,以减少 CPU-GPU 交互开销,释放 CPU 用于其他任务,并提高 GPU 执行资源的利用率。我们描述了动态并行性的基本概念以及为什么某些算法可以从动态并行性中受益。然后我们用一个例子来说明动态并行性的用法小型设计的代码示例以及更复杂的实际代码示例。
Chapter 20 introduces the dynamic parallelism capability available in the Kepler GPUs and their successors. Dynamic parallelism can potentially help the implementations of sophisticated algorithms to reduce CPU-GPU interaction overhead, free up CPU for other tasks, and improve the utilization of GPU execution resources. We describe the basic concepts of dynamic parallelism and why some algorithms can benefit from dynamic parallelism. We then illustrate the usage of dynamic parallelism with a small contrived code example as well as a more complex realistic code example.
第 21 章提供了一些结论以及对大规模并行编程的未来的展望。我们首先重新审视我们的目标,并总结各章如何组合在一起以帮助实现目标。然后,我们简要调查了大规模并行处理器架构的主要趋势,以及这些趋势将如何影响未来的并行编程。最后,我们预测大规模并行计算的快速进步将使其成为未来十年最令人兴奋的领域之一。
Chapter 21 offers some concluding remarks and an outlook for the future of massively parallel programming. We first revisit our goals and summarize how the chapters fit together to help achieve the goals. We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future. We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade.
1. Gelado, I.、Cabezas, J.、Navarro, N.、Stone, JE、Patel, SJ 和 Hwu, WW 异构并行系统的异步分布式共享内存模型,编程语言和操作的架构支持国际会议Systems,2010 年 3 月。技术报告,IMPACT Group,伊利诺伊大学厄巴纳-香槟分校。
1. Gelado, I., Cabezas, J., Navarro, N., Stone, J. E., Patel, S. J., & Hwu, W. W. An Asynchronous Distributed Shared Memory Model for Heterogeneous Parallel Systems, International Conference on Architectural Support for Programming Languages and Operating Systems, March 2010. Technical Report, IMPACT Group, University of Illinois, Urbana-Champaign.
2. Hwu WW、Keutzer K、Mattson T。并发挑战。IEEE 计算机设计和测试, 2008 年 7 月/8 月:312–320。
2. Hwu WW, Keutzer K, Mattson T. The Concurrency Challenge. IEEE Design and Test of Computers, July/August 2008:312–320.
3. Khronos Group,OpenCL 规范版本 1.0,可从以下网址获取:( http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf )。
3. The Khronos Group, The OpenCL Specification Version 1.0, Available at: (http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf).
4. 马特森 TG、桑德斯 BA、马辛吉尔 BL。并行编程模式波士顿:Addison-Wesley Professional; 2004年。
4. Mattson TG, Sanders BA, Massingill BL. Patterns of Parallel Programming Boston: Addison-Wesley Professional; 2004.
5. 消息传递接口论坛,“MPI — 消息传递接口标准版本 2.2”,网址为:< http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf >。,9 月2009 年 4 月。
5. Message Passing Interface Forum, “MPI—A Message Passing Interface Standard Version 2.2,” Available at: <http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf>., Sept. 4, 2009.
6. NVIDIA 公司,NVIDIA CUDA 计算统一设备架构编程指南 1.0,2007 年 6 月,可在以下网址获取:< http://www.cs.berkeley.edu/~yelick/cs194f07/handouts/NVIDIA_CUDA_Programming_Guide.pdf >。
6. NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.0, June 2007, Available at: <http://www.cs.berkeley.edu/~yelick/cs194f07/handouts/NVIDIA_CUDA_Programming_Guide.pdf>.
7. OpenMP 架构审查委员会,“OpenMP 应用程序接口版本 3.1”。 2011 年 7 月,可访问:< http://www.openmp.org/mp-documents/OpenMP3.1.pdf >。
7. OpenMP Architecture Review Board, “OpenMP Application Program Interface Version 3.1.” July 2011, Available at: <http://www.openmp.org/mp-documents/OpenMP3.1.pdf>.
8. Sutter H、Larus J。软件和并发革命。ACM 队列。 2005 年 9 月;3(7):54–62。
8. Sutter H, Larus J. Software and the Concurrency Revolution. ACM Queue. Sept. 2005;3(7):54–62.
9. von Neumann J. EDVAC 报告初稿。见:Goldstine HH,编辑。计算机:从帕斯卡到冯·诺依曼。新泽西州普林斯顿:普林斯顿大学出版社; 1972年。
9. von Neumann J. First Draft of a Report on the EDVAC. In: Goldstine HH, ed. The Computer: From Pascal to von Neumann. Princeton, NJ: Princeton University Press; 1972.
10. Wing J. 计算思维。ACM 的通讯。 2006;49(3):33-35。
10. Wing J. Computational Thinking. Communications of the ACM. 2006;49(3):33–35.
2.1 图形管道的演变
2.1 Evolution of Graphics Pipelines
2.2 GPGPU:中间步骤
2.2 GPGPU: An Intermediate Step
2.3 GPU计算
2.3 GPU Computing
对于 CUDA C 和 OpenCL 程序员来说,GPU 是用 C 语言编程并带有扩展的大规模并行数值计算处理器。人们不需要了解图形算法或术语就能够对这些处理器进行编程。然而,了解这些处理器的图形传统可以阐明它们在主要计算模式方面的优点和缺点。特别是,历史有助于阐明现代可编程 GPU 主要架构设计决策背后的基本原理:大规模多线程、与 CPU 相比相对较小的高速缓存以及以带宽为中心的内存接口设计。对历史发展的洞察也可能为读者提供预测 GPU 作为计算设备的未来演变所需的背景。
To CUDA C and OpenCL programmers, GPUs are massively parallel numeric computing processors programmed in C with extensions. One does not need to understand graphics algorithms or terminology to be able to program these processors. However, understanding the graphics heritage of these processors illuminates the strengths and weaknesses of them with respect to major computational patterns. In particular, the history helps to clarify the rationale behind major architectural design decisions of modern programmable GPUs: massive multithreading, relatively small cache memories compared to CPUs, and bandwidth-centric memory interface design. Insights into the historical developments will also likely give readers the context needed to project the future evolution of GPUs as computing devices.
三维 (3D) 图形管道硬件从 20 世纪 80 年代初的大型昂贵系统发展到小型工作站,再到 20 世纪 90 年代中后期的 PC 加速器。在此期间,性能领先的图形子系统的价格从 50,000 美元下降到 500 美元。同期,性能从每秒 5000 万像素提升到每秒 10 亿像素,从每秒 10 万个顶点提升到每秒 1000 万个顶点。虽然这些这些进步与半导体器件特征尺寸的不断缩小有很大关系,也来自图形算法和硬件设计创新的新创新。这些创新塑造了现代 GPU 的原生硬件功能。
Three-dimensional (3D) graphics pipeline hardware evolved from the large expensive systems of the early 1980s to small workstations and then PC accelerators in the mid- to late 1990s. During this period, the performance-leading graphics subsystems declined in price from $50,000 to $500. During the same period, the performance increased from 50 million pixels per second to 1 billion pixels per second, and from 100,000 vertices per second to 10 million vertices per second. While these advancements have much to do with the relentlessly shrinking feature sizes of semiconductor devices, they also come from the new innovations of graphics algorithms and hardware design innovations. These innovations have shaped the native hardware capabilities of modern GPUs.
计算机应用中对高质量实时图形的市场需求推动了图形硬件性能的显着进步。例如,在电子游戏应用中,需要以每秒 60 帧的速率以不断增加的分辨率渲染越来越复杂的场景。最终结果是,在过去 30 年中,图形架构已经从用于绘制线框图的简单管道发展为由多个能够渲染 3D 场景的复杂交互式图像的深度并行管道组成的高度并行设计。同时,涉及的许多硬件功能变得更加复杂且用户可编程。
The remarkable advancement of graphics hardware performance has been driven by the market demand for high-quality real-time graphics in computer applications. For example, in an electronic gaming application, one needs to render evermore complex scenes at ever-increasing resolution at a rate of 60 frames per second. The net result is that over the last 30 years, graphics architecture has evolved from a simple pipeline for drawing wireframe diagrams to a highly parallel design consisting of several deep parallel pipelines capable of rendering complex interactive imagery of 3D scenes. Concurrently, many of the hardware functionalities involved became far more sophisticated and user programmable.
从 20 世纪 80 年代初到 90 年代末,领先的性能图形硬件是可配置但不可编程的固定功能管道。在同一时代,主要的图形应用程序编程接口 (API) 库开始流行。 API是软件的标准化层,即允许应用程序(例如游戏)使用软件或硬件服务和功能的库函数的集合。例如,API 可以允许游戏向图形处理单元发送命令以在显示器上绘制对象。 DirectX 就是这样的 API 之一,它是 Microsoft 用于媒体功能的专有 API。 DirectX 的 Direct3D 组件为图形处理器提供接口功能。另一个主要的 API 是 OpenGL,这是一个由多个供应商支持的开放标准 API,在专业工作站应用程序中很流行。这个固定功能图形管线的时代大致对应于DirectX的前七代。
From the early 1980s to the late 1990s, the leading performance graphics hardware was fixed-function pipelines that were configurable, but not programmable. In that same era, major graphics Application Programming Interface (API) libraries became popular. An API is a standardized layer of software, that is, a collection of library functions that allows applications (e.g., games) to use software or hardware services and functionality. For example, an API can allow a game to send commands to a graphics processing unit to draw objects on a display. One such API is DirectX, Microsoft’s proprietary API for media functionality. The Direct3D component of DirectX provides interface functions to graphics processors. The other major API is OpenGL, an open-standard API supported by multiple vendors and popular in professional workstation applications. This era of fixed-function graphics pipeline roughly corresponds to the first seven generations of DirectX.
直接内存访问
Direct Memory Access
现代计算机系统使用称为直接内存访问 (DMA) 的专用硬件机制在 I/O 设备和系统 DRAM 之间传输数据。当程序请求 I/O 操作(例如从磁盘驱动器读取数据)时,操作系统会通过设置由 I/O 设备缓冲存储器中数据的起始地址定义的 DMA 操作来进行安排。 DRAM内存,要复制的字节数以及复制的方向。
Modern computer systems use a specialized hardware mechanism called direct memory access (DMA) to transfer data between an I/O device and the system DRAM. When a program requests an I/O operation, say reading from a disk drive, the operating system makes an arrangement by setting a DMA operation defined by the starting address of the data in the I/O device buffer memory, the starting address of the DRAM memory, the number of bytes to be copied, and the direction of the copy.
使用专门的硬件机制在 I/O 设备和系统 DRAM 之间复制数据有两个主要优点。首先,CPU 不再承担复制数据的繁琐工作。因此,当 DMA 硬件复制数据时,CPU 可以执行不依赖于 I/O 数据的程序。
Using a specialized hardware mechanism to copy data between I/O devices and system DRAM has two major advantages. First, the CPU is not burdened with the chore of copying data. So, while the DMA hardware is copying data, the CPU can execute programs that do not depend on the I/O data.
使用专门的硬件机制来复制数据的第二个优点是硬件被设计为执行复制。硬件非常简单且高效。执行复制时没有获取和解码指令的开销。因此,复制的速度可以比大多数处理器更高。
The second advantage of using a specialized hardware mechanism to copy data is that the hardware is designed to perform copy. The hardware is very simple and efficient. There is no overhead of fetching and decoding instructions while performing the copy. As a result, the copy can be done at a higher speed than most processors can.
正如我们稍后将了解到的,DMA 用于 CPU 和 GPU 之间的数据复制操作。它需要 DRAM 中的固定内存,并对应用程序如何分配内存有微妙的影响。
As we will learn later, DMA is used in data copy operations between a CPU and a GPU. It requires pinned memory in DRAM and has subtle implications on how applications should allocate memory.
图 2.1显示了早期 NVIDIA GeForce GPU 中固定功能图形管道的示例。主机接口从CPU接收图形命令和数据。这些命令通常由应用程序通过调用 API 函数给出。主机接口通常包含专用的 DMA 硬件,以有效地将批量数据传输到主机系统内存和图形管道之间。主机接口还传回执行命令的状态和结果数据。
Figure 2.1 shows an example fixed-function graphics pipeline in early NVIDIA GeForce GPUs. The host interface receives graphics commands and data from the CPU. The commands are typically given by application programs by calling an API function. The host interface typically contains a specialized DMA hardware to efficiently transfer bulk data to and from the host system memory to the graphics pipeline. The host interface also communicates back the status and result data of executing the commands.
图 2.1固定功能 NVIDIA GeForce 图形管道。
Figure 2.1 A fixed-function NVIDIA GeForce graphics pipeline.
在描述管道的其他阶段之前,我们应该澄清术语“顶点”通常意味着多边形的“角”。这GeForce 图形管道旨在渲染三角形,因此顶点通常用于指代三角形的角。物体的表面被绘制为三角形的集合。三角形的尺寸越细,图像的质量通常就越好。图 2.1中的顶点控制阶段从 CPU 接收参数化的三角形数据。顶点控制阶段将三角形数据转换为硬件可以理解的形式,并将准备好的数据放入顶点缓存中。
Before we describe the other stages of the pipeline, we should clarify that the term vertex usually means the “corners” of a polygon. The GeForce graphics pipeline is designed to render triangles, so vertex is typically used to refer to the corners of a triangle. The surface of an object is drawn as a collection of triangles. The finer the sizes of the triangles are, the better the quality of the picture typically becomes. The vertex control stage in Figure 2.1 receives parameterized triangle data from the CPU. The vertex control stage converts the triangle data into a form that the hardware understands and places the prepared data into the vertex cache.
图 2.1中的顶点着色、变换和光照 (VS/T&L) 阶段变换顶点并分配每个顶点值(颜色、法线、纹理坐标、切线等)。着色是由像素着色器硬件完成的。顶点着色器可以为每个顶点分配一种颜色,但直到稍后它才会应用于三角形像素。三角形设置阶段进一步创建边缘方程,用于在三角形触及的像素上内插颜色和其他每顶点数据(例如,纹理坐标)。光栅阶段确定每个三角形中包含哪些像素。对于每个像素,光栅阶段都会插入对像素进行着色所需的每个顶点值,其中包括将在像素上着色(绘制)的颜色、位置和纹理位置。
The vertex shading, transform, and lighting (VS/T&L) stage in Figure 2.1 transforms vertices and assigns per-vertex values (colors, normals, texture coordinates, tangents, etc.). The shading is done by the pixel shader hardware. The vertex shader can assign a color to each vertex but it is not applied to triangle pixels until later. The triangle setup stage further creates edge equations that are used to interpolate colors and other per-vertex data (e.g., texture coordinates) across the pixels touched by the triangle. The raster stage determines which pixels are contained in each triangle. For each of these pixels, the raster stage interpolates per-vertex values necessary for shading the pixel, which includes color, position, and texture position that will be shaded (painted) on the pixel.
图 2.1中的着色器阶段决定每个像素的最终颜色。这可以通过多种技术的组合效果来生成:顶点颜色插值、纹理映射、每像素照明数学、反射等等。着色器阶段中包含了许多使渲染图像更加逼真的效果。图 2.2说明了纹理映射,这是着色器阶段的功能之一。它显示了一个将世界地图纹理映射到球体对象上的示例。请注意,球体对象被描述为三角形的大集合。尽管着色器阶段仅需要执行少量坐标变换计算来识别将在描述球体对象的三角形之一中的点上绘制的纹理点的精确坐标,但由球体对象覆盖的像素的绝对数量图像需要着色器阶段为每一帧执行大量的坐标变换。
The shader stage in Figure 2.1 determines the final color of each pixel. This can be generated as a combined effect of many techniques: interpolation of vertex colors, texture mapping, per-pixel lighting mathematics, reflections, and more. Many effects that make the rendered images more realistic are incorporated in the shader stage. Figure 2.2 illustrates texture mapping, one of the shader stage functionalities. It shows an example in which a world map texture is mapped onto a sphere object. Note that the sphere object is described as a large collection of triangles. Although the shader stage needs to perform only a small number of coordinate transform calculations to identify the exact coordinates of the texture point that will be painted on a point in one of the triangles that describes the sphere object, the sheer number of pixels covered by the image requires the shader stage to perform a very large number of coordinate transforms for each frame.
图2.2纹理贴图示例:绘制世界地图纹理图像。
Figure 2.2 Texture mapping example: painting a world map texture image.
图 2.2中的 ROP(光栅操作)阶段对像素执行最终的光栅操作。它执行颜色光栅操作,混合重叠/相邻对象的颜色以获得透明度和抗锯齿效果。它还确定给定视点的可见对象并丢弃被遮挡的像素。当一个像素根据给定的视点被来自其他对象的像素阻挡时,该像素就会被遮挡。
The ROP (raster operation) stage in Figure 2.2 performs the final raster operations on the pixels. It performs color raster operations that blend the color of overlapping/adjacent objects for transparency and anti-aliasing effects. It also determines the visible objects for a given viewpoint and discards the occluded pixels. A pixel becomes occluded when it is blocked by pixels from other objects according to the given viewpoint.
图 2.3说明了抗锯齿,这是 ROP 阶段操作之一。有三个相邻的黑色背景三角形。在别名中输出时,每个像素呈现一个物体或背景的颜色。有限的分辨率使边缘看起来弯曲,物体的形状扭曲。问题在于许多像素部分位于一个对象中,部分位于另一对象或背景中。强制这些像素呈现其中一个对象的颜色会导致对象边缘失真。抗锯齿操作为每个像素提供一种颜色,该颜色是与部分重叠该像素的所有对象和背景的颜色混合或线性组合的。每个对象对像素颜色的贡献是对象重叠的像素数量。
Figure 2.3 illustrates anti-aliasing, one of the ROP stage operations. There are three adjacent triangles with a black background. In the aliased output, each pixel assumes the color of one of the objects or the background. The limited resolution makes the edges look crooked and the shapes of the objects distorted. The problem is that many pixels are partly in one object and partly in another object or the background. Forcing these pixels to assume the color of one of the objects introduces distortion into the edges of the objects. The anti-aliasing operation gives each pixel a color that is blended, or linearly combined, from the colors of all the objects and background that partly overlap the pixel. The contribution of each object to the color of the pixel is to the amount of the pixel that the object overlaps.
图 2.3抗锯齿操作示例:(a) 三角形几何,(b) 锯齿,(c) 抗锯齿。
Figure 2.3 Examples of anti-aliasing operations: (a) triangle geometry, (b) aliased, and (c) anti-aliased.
最后,图 2.1中的帧缓冲接口 (FBI) 阶段管理对显示帧缓冲存储器的内存读取和写入。对于高分辨率显示器,访问帧缓冲区有非常高的带宽要求。这样的带宽是通过两种策略实现的。一是图形管道通常使用特殊的内存设计,提供比系统内存更高的带宽。其次,FBI 同时管理连接到多个存储体的多个存储通道。多个通道和特殊存储器结构的组合带宽改进使帧缓冲器比同时代的系统存储器具有更高的带宽。如此高的内存带宽一直延续至今,并已成为现代 GPU 设计的显着特征。
Finally, the frame buffer interface (FBI) stage in Figure 2.1 manages memory reads from and writes to the display frame buffer memory. For high-resolution displays, there is a very high bandwidth requirement in accessing the frame buffer. Such bandwidth is achieved by two strategies. One is that graphics pipelines typically use special memory designs that provide higher bandwidth than the system memories. Second, the FBI simultaneously manages multiple memory channels that connect to multiple memory banks. The combined bandwidth improvement of multiple channels and special memory structures gives the frame buffers much higher bandwidth than their contemporaneous system memories. Such high memory bandwidth has continued to this day and has become a distinguishing feature of modern GPU design.
在过去的二十年里,每一代硬件及其相应一代的 API 都为图形管道的各个阶段带来了渐进式的改进。每一代都为流水线阶段引入了硬件资源和可配置性。然而,开发人员变得越来越复杂,并要求比作为内置固定功能合理提供的更多新功能。显然,下一步是将其中一些图形管道阶段制作成可编程处理器。
During the past two decades, each generation of hardware and its corresponding generation of API brought incremental improvements to the various stages of the graphics pipeline. Each generation introduced hardware resources and configurability to the pipeline stages. However, developers were growing more sophisticated and asking for more new features than could be reasonably offered as built-in fixed functions. The obvious next step was to make some of these graphics pipeline stages into programmable processors.
2001 年,NVIDIA GeForce 3 向真正的通用着色器可编程性迈出了第一步。它向应用程序开发人员展示了浮点顶点引擎(VS/T&L 阶段)的私有内部指令集。这与 Microsoft DirectX 8 和 OpenGL 顶点着色器扩展的发布同时发生。后来的 GPU,在 DirectX 9 时代,将通用可编程性和浮点功能扩展到像素着色器阶段,并使得可以从顶点着色器访问纹理阶段。 ATI Radeon 9700 于 2002 年推出,具有采用 DirectX 9 和 OpenGL 编程的可编程 24 位浮点像素着色器处理器。 GeForce FX 添加了 32 位浮点像素处理器。正如应用程序员所见,这些可编程像素着色器处理器是统一不同阶段功能的总体趋势的一部分。 NVIDIA GeForce 6800 和 7800 系列采用专门用于顶点和像素处理的独立处理器设计。 Xbox 360 在 2005 年推出了早期的统一处理器 GPU,允许顶点和像素着色器在同一处理器上执行。
In 2001, the NVIDIA GeForce 3 took the first step toward true general shader programmability. It exposed the application developer to what had been the private internal instruction set of the floating-point vertex engine (VS/T&L stage). This coincided with the release of Microsoft DirectX 8 and OpenGL vertex shader extensions. Later GPUs, at the time of DirectX 9, extended general programmability and floating-point capability to the pixel shader stage, and made texture accessible from the vertex shader stage. The ATI Radeon 9700, introduced in 2002, featured a programmable 24-bit floating-point pixel shader processor programmed with DirectX 9 and OpenGL. The GeForce FX added 32-bit floating-point pixel processors. These programmable pixel shader processors were part of a general trend toward unifying the functionality of the different stages as seen by the application programmer. NVIDIA’s GeForce 6800 and 7800 series were built with separate processor designs dedicated to vertex and pixel processing. The XBox 360 introduced an early unified-processor GPU in 2005, allowing vertex and pixel shaders to execute on the same processor.
在图形管道中,某些阶段对完全独立的数据进行大量浮点运算,例如变换三角形顶点的位置或生成像素颜色。这种数据独立性作为主要的应用特征是 GPU 和 CPU 设计假设之间的关键区别。以 1/60 秒渲染的单个帧可能有 100 万个三角形和 600 万个像素。使用硬件并行性来利用这种数据独立性的机会是巨大的。
In graphics pipelines, certain stages do a great deal of floating-point arithmetic on completely independent data, such as transforming the positions of triangle vertices or generating pixel colors. This data independence as the dominating application characteristic is a key difference between the design assumptions for GPUs and CPUs. A single frame, rendered in 1/60 of a second, might have 1 million triangles and 6 million pixels. The opportunity to use hardware parallelism to exploit this data independence is tremendous.
几个图形管线阶段执行的具体功能因渲染算法而异。这种变化促使硬件设计人员使这些流水线阶段可编程。两个特殊的可编程阶段脱颖而出:顶点着色器和像素着色器。顶点着色器程序将三角形顶点的位置映射到屏幕上,改变它们的位置、颜色或方向。通常,顶点着色器线程读取浮点 ( x , y , z , w ) 顶点位置并计算浮点 ( x , y , z ) 屏幕位置。几何着色器程序对由多个顶点定义的图元进行操作,更改它们或生成其他图元。顶点着色器程序和几何着色器程序在图形管道的 VS/T&L 阶段执行。
The specific functions executed at a few graphics pipeline stages vary with rendering algorithms. Such variation has motivated the hardware designers to make those pipeline stages programmable. Two particular programmable stages stand out: the vertex shader and the pixel shader. Vertex shader programs map the positions of triangle vertices onto the screen, altering their position, color, or orientation. Typically a vertex shader thread reads a floating-point (x, y, z, w) vertex position and computes a floating-point (x, y, z) screen position. Geometry shader programs operate on primitives defined by multiple vertices, changing them or generating additional primitives. Vertex shader programs and geometry shader programs execute on the VS/T&L stage of the graphics pipeline.
像素着色器对每个像素进行“着色”,计算浮点红、绿、蓝、alpha (RGBA) 颜色对渲染图像在其像素样本 ( x , y ) 图像位置的贡献。这些程序在图形管道的着色器阶段执行。对于所有三种类型的图形着色器程序,程序实例都可以并行运行,因为每个实例都处理独立的数据,产生独立的结果,并且没有副作用。这一特性促使人们将可编程流水线级设计成大规模并行处理器。
Pixel shader programs each “shade” one pixel, computing a floating-point red, green, blue, alpha (RGBA) color contribution to the rendered image at its pixel sample (x, y) image position. These programs execute on the shader stage of the graphics pipeline. For all three types of graphics shader programs, program instances can be run in parallel, because each works on independent data, produces independent results, and has no side effects. This property has motivated the design of the programmable pipeline stages into massively parallel processors.
图 2.4显示了使用顶点处理器和片段(像素)处理器的可编程流水线的示例。可编程顶点处理器执行指定给VS/T&L阶段的程序可编程片段处理器执行指定给(像素)着色器阶段的程序。在这些可编程图形流水线级之间有数十个固定功能级,它们执行明确定义的任务的效率远远高于可编程处理器,并且从可编程性中获得的好处要少得多。例如,在几何处理阶段和像素处理阶段之间是一个“光栅化器”,这是一种复杂的状态机,它准确地确定哪些像素(及其部分)位于每个几何图元的边界内。可编程和固定功能阶段的组合旨在平衡极端性能与用户对渲染算法的控制。
Figure 2.4 shows an example of a programmable pipeline that employs a vertex processor and a fragment (pixel) processor. The programmable vertex processor executes the programs designated to the VS/T&L stage and the programmable fragment processor executes the programs designated to the (pixel) shader stage. Between these programmable graphics pipeline stages are dozens of fixed-function stages that perform well-defined tasks far more efficiently than a programmable processor could, and that would benefit far less from programmability. For example, between the geometry processing stage and the pixel processing stage is a “rasterizer,” a complex state machine that determines exactly which pixels (and portions thereof) lie within each geometric primitive’s boundaries. Together, the mix of programmable and fixed-function stages is engineered to balance extreme performance with user control over the rendering algorithms.
图 2.4可编程图形管道中单独的顶点处理器和片段处理器的示例。
Figure 2.4 An example of a separate vertex processor and fragment processor in a programmable graphics pipeline.
常见的渲染算法对输入基元执行单次传递,并以高度一致的方式访问其他内存资源。也就是说,这些算法倾向于同时访问连续的内存位置,例如邻域中的所有三角形或所有像素。因此,这些算法在内存带宽利用方面表现出出色的效率,并且对内存延迟基本上不敏感。结合通常受计算限制的像素着色器工作负载,这些特性引导 GPU 沿着与 CPU 不同的进化路径发展。特别是,CPU 芯片区域主要由高速缓冲存储器主导,而 GPU 主要由浮点数据路径和固定功能逻辑主导。 GPU 内存接口强调带宽而非延迟(因为大规模并行执行可以轻松隐藏延迟);事实上,带宽通常比 CPU 高很多倍,在最新的设计中超过 190 GB/s。
Common rendering algorithms perform a single pass over input primitives and access other memory resources in a highly coherent manner. That is, these algorithms tend to simultaneously access contiguous memory locations, such as all triangles or all pixels in a neighborhood. As a result, these algorithms exhibit excellent efficiency in memory bandwidth utilization and are largely insensitive to memory latency. Combined with a pixel shader workload that is usually compute-limited, these characteristics have guided GPUs along a different evolutionary path than CPUs. In particular, whereas the CPU die area is dominated by cache memories, GPUs are dominated by floating-point data path and fixed-function logic. GPU memory interfaces emphasize bandwidth over latency (since latency can be readily hidden by massively parallel execution); indeed, bandwidth is typically many times higher than a CPU, exceeding 190 GB/s in more recent designs.
NVIDIA 的 GeForce 8800 GPU 于 2006 年推出,将单独的可编程图形阶段映射到一系列统一的处理器;逻辑图形管道在物理上是一个循环路径,它访问这些处理器三次,在访问之间有很多固定功能的图形逻辑。如图 2.5所示。统一的处理器阵列允许将阵列动态划分为顶点着色、几何处理和像素处理。由于不同的渲染算法在三个可编程阶段之间呈现出截然不同的负载,因此这种统一允许将相同的执行资源池动态分配到不同的管道阶段并实现更好的负载平衡。
Introduced in 2006, NVIDIA’s GeForce 8800 GPU mapped the separate programmable graphics stages to an array of unified processors; the logical graphics pipeline is physically a recirculating path that visits these processors three times, with much fixed-function graphics logic between visits. This is illustrated in Figure 2.5. The unified processor array allows dynamic partitioning of the array to vertex shading, geometry processing, and pixel processing. Since different rendering algorithms present wildly different loads among the three programmable stages, this unification allows the same pool of execution resources to be dynamically allocated to different pipeline stages and achieve better load balance.
图 2.5 GeForce 8800 GT 图形管线的统一可编程处理器阵列。
Figure 2.5 Unified programmable processor array of the GeForce 8800 GT graphics pipeline.
GeForce 8800 硬件对应 DirectX 10 API 一代。到 DirectX 10 一代,顶点和像素的功能着色器将与程序员相同,并且引入了一个新的逻辑阶段,即几何着色器,用于处理图元的所有顶点而不是孤立的顶点。 GeForce 8800 的设计考虑了 DirectX 10。开发人员提出了更复杂的着色算法,这促使可用着色器操作率(尤其是浮点操作)急剧增加。 NVIDIA 追求的处理器设计具有比标准单元方法所允许的更高的工作时钟频率,以尽可能高效地提供所需的操作吞吐量。高时钟速度设计需要更多的工程工作,这有利于设计一个处理器阵列,而不是两个(或三个,考虑到新的几何级)。在寻求单一处理器设计的优势的同时,应对统一处理器的工程挑战(负载平衡和逻辑管道再循环到处理器阵列的线程)变得值得。这种设计为使用可编程 GPU 处理器阵列进行通用数值计算铺平了道路。
The GeForce 8800 hardware corresponds to the DirectX 10 API generation. By the DirectX 10 generation, the functionality of vertex and pixel shaders was to be made identical to the programmer, and a new logical stage was introduced, the geometry shader, to process all the vertices of a primitive rather than vertices in isolation. The GeForce 8800 was designed with DirectX 10 in mind. Developers were coming up with more sophisticated shading algorithms and this motivated a sharp increase in the available shader operation rate, particularly floating-point operations. NVIDIA pursued a processor design with higher operating clock frequency than what was allowed by standard-cell methodologies to deliver the desired operation throughput as area-efficiently as possible. High–clock speed design requires substantially more engineering effort, and this favored designing one processor array, rather than two (or three, given the new geometry stage). It became worthwhile to take on the engineering challenges of a unified processor (load balancing and recirculation of a logical pipeline onto threads of the processor array) while seeking the benefits of one processor design. Such design paved the way for using the programmable GPU processor array for general numeric computing.
虽然 GPU 硬件设计朝着更加统一的处理器发展,但它越来越类似于高性能并行计算机。随着支持 DirectX 9 的 GPU 的出现,一些研究人员注意到 GPU 的原始性能增长路径,并开始探索使用 GPU 来解决计算密集型科学和工程问题。然而,DirectX 9 GPU 的设计只是为了满足图形 API 所需的功能。为了访问计算资源,程序员必须将他或她的问题转化为图形操作,以便可以通过 OpenGL 或 DirectX API 调用启动计算。例如,要同时运行计算函数的多个实例,必须将其编写为像素着色器。输入数据的集合必须存储在纹理图像中,并通过提交三角形(如果需要的话,可以裁剪为矩形形状)将其发送到 GPU。输出必须转换为光栅操作生成的一组像素。
While the GPU hardware design evolved toward more unified processors, it increasingly resembled high-performance parallel computers. As DirectX 9–capable GPUs became available, some researchers took notice of the raw performance growth path of GPUs and they started to explore the use of GPUs to solve compute-intensive science and engineering problems. However, DirectX 9 GPUs had been designed only to match the features required by the graphics APIs. To access the computational resources, a programmer had to cast his or her problem into graphics operations so that the computation could be launched through OpenGL or DirectX API calls. For example, to run many simultaneous instances of a compute function, it had to be written as a pixel shader. The collection of input data had to be stored in texture images and issued to the GPU by submitting triangles (with clipping to a rectangle shape if that’s what was desired). The output had to be cast as a set of pixels generated from the raster operations.
事实上,GPU 处理器阵列和帧缓冲存储器接口被设计用于处理图形数据,这对于一般的数值应用程序来说限制太大。具体地,着色器程序的输出数据是其存储位置已被预先确定的单个像素。因此,图形处理器阵列的设计受到很大限制内存读写能力。图2.6说明了早期可编程着色器处理器阵列的有限存储器访问能力;着色器程序员需要使用纹理来访问其输入数据的任意内存位置。更重要的是,着色器无法使用计算出的内存地址执行写入操作(称为分散操作)。将结果写入内存的唯一方法是将其作为像素颜色值发出,并配置帧缓冲区操作阶段以将结果写入(或混合,如果需要)到 2D 帧缓冲区。
The fact that the GPU processor array and frame buffer memory interface were designed to process graphics data proved too restrictive for general numeric applications. In particular, the output data of the shader programs are single pixels of which the memory location has been predetermined. Thus, the graphics processor array is designed with very restricted memory reading and writing capability. Figure 2.6 illustrates the limited memory access capability of early programmable shader processor arrays; shader programmers needed to use texture to access arbitrary memory locations for their input data. More importantly, shaders did not have the means to perform writes with calculated memory addresses, referred to as scatter operations, to memory. The only way to write a result to memory was to emit it as a pixel color value, and configure the frame buffer operation stage to write (or blend, if desired) the result to a 2D frame buffer.
图 2.6着色器编程模型的受限输入和输出功能。
Figure 2.6 The restricted input and output capabilities of a shader programming model.
此外,从一次计算到下一次计算获得结果的唯一方法是将所有并行结果写入像素帧缓冲区,然后使用该帧缓冲区作为纹理贴图作为下一阶段像素片段着色器的输入。计算。也不支持一般用户定义的数据类型——大多数数据必须存储在一分量、二分量或四分量向量数组中。在这个时代,将通用计算映射到 GPU 是相当尴尬的。尽管如此,勇敢的研究人员通过艰苦的努力展示了一些有用的应用程序。这个领域被称为 GPGPU,即 GPU 上的通用计算。
Furthermore, the only way to get a result from one pass of computation to the next was to write all parallel results to a pixel frame buffer, then use that frame buffer as a texture map as input to the pixel fragment shader of the next stage of the computation. There was also no support for general user-defined data types—most data had to be stored in one-, two-, or four-component vector arrays. Mapping general computations to a GPU in this era was quite awkward. Nevertheless, intrepid researchers demonstrated a handful of useful applications with painstaking efforts. This field was called GPGPU, for general-purpose computing on GPUs.
在开发 Tesla GPU 架构时,NVIDIA 意识到,如果程序员能够想到,它的潜在用处将会大得多。GPU就像处理器一样。 NVIDIA 选择了一种编程方法,程序员将明确声明其工作负载的数据并行方面。
While developing the Tesla GPU architecture, NVIDIA realized its potential usefulness would be much greater if programmers could think of the GPU like a processor. NVIDIA selected a programming approach in which programmers would explicitly declare the data-parallel aspects of their workload.
对于 DirectX 10 代显卡,NVIDIA 已经开始开发高效浮点和整数处理器,该处理器可以同时运行各种工作负载以支持逻辑图形管道。 Tesla 架构 GPU 的设计者又迈出了一步。着色器处理器成为具有指令存储器、指令高速缓存和指令排序控制逻辑的完全可编程处理器。通过让多个着色器处理器共享其指令缓存和指令排序控制逻辑,可以降低这些额外硬件资源的成本。这种设计风格非常适合图形应用程序,因为需要将相同的着色器程序应用于大量顶点或像素。 NVIDIA 添加了具有随机字节寻址功能的内存加载和存储指令,以支持编译的 C 程序的要求。对于非图形应用程序程序员来说,Tesla 架构 GPU 引入了一种更通用的并行编程模型,具有并行线程层次结构、屏障同步和原子操作来调度和管理高度并行计算工作。 NVIDIA 还开发了 CUDA C/C++ 编译器、库和运行时软件,使程序员能够轻松访问新的数据并行计算模型并开发应用程序。程序员不再需要使用图形API来访问GPU并行计算能力。 G80芯片基于Tesla架构,用于NVIDIA的GeForce 8800 GTX。 G80 之后是 G92、GT200、费米和开普勒。
For the DirectX 10–generation graphics, NVIDIA had already begun work on a high-efficiency floating-point and integer processor that could run a variety of simultaneous workloads to support the logical graphics pipeline. The designers of the Tesla architecture GPUs took another step. The shader processors became fully programmable processors with instruction memory, instruction cache, and instruction sequencing control logic. The cost of these additional hardware resources was reduced by having multiple shader processors to share their instruction cache and instruction sequencing control logic. This design style works well with graphics applications because the same shader program needs to be applied to a massive number of vertices or pixels. NVIDIA added memory load and store instructions with random byte addressing capability to support the requirements of compiled C programs. To nongraphics application programmers, the Tesla architecture GPUs introduced a more generic parallel programming model with a hierarchy of parallel threads, barrier synchronization, and atomic operations to dispatch and manage highly parallel computing work. NVIDIA also developed the CUDA C/C++ compiler, libraries, and runtime software to enable programmers to readily access the new data-parallel computation model and develop applications. Programmers no longer need to use the graphics API to access the GPU parallel computing capabilities. The G80 chip was based on the Tesla architecture and was used in NVIDIA’s GeForce 8800 GTX. G80 was followed later by G92, GT200, Fermi, and Kepler.
可扩展性从一开始就是图形系统的一个有吸引力的特性。早期,工作站图形系统通过改变安装的像素处理器电路板的数量为客户提供了像素马力的选择。在 20 世纪 90 年代中期之前,PC 图形缩放几乎不存在。只有一个选择——VGA 控制器。随着支持 3D 的加速器的出现,市场上出现了一系列产品的空间;例如,3dfx 在他们的 Voodoo2 上引入了带有原始 SLI(扫描线交错)的多板缩放,Voodoo2 夺得了当时的性能桂冠(1998 年)。同样是在 1998 年,NVIDIA 推出了不同的产品作为单一架构的变体,包括 Riva TNT Ultra(高性能)和 Vanta(低成本),首先是速度分级和封装,然后采用单独的芯片设计(GeForce 2 GTS 和 GeForce 2 MX)。目前,对于给定的架构一代,需要四到五个独立的芯片设计来覆盖台式电脑的性能和价格点范围。此外,笔记本电脑和工作站系统也有单独的部分。收购 3dfx 后,NVIDIA 于 2004 年从 GeForce 6800 开始延续多 GPU SLI 概念,为程序员和用户透明地提供多 GPU 可扩展性。在整个缩放范围内,功能行为是相同的;一个应用程序将在架构系列的任何实现上运行不变。
Scalability has been an attractive feature of graphics systems from the beginning. In the early days, workstation graphics systems gave customers a choice in pixel horsepower by varying the number of pixel processor circuit boards installed. Prior to the mid-1990s, PC graphics scaling was almost nonexistent. There was one option—the VGA controller. As 3D-capable accelerators appeared, there was room in the market for a range of offerings; for instance, 3dfx introduced multiboard scaling with the original SLI (scan line interleave) on their Voodoo2, which held the performance crown for its time (1998). Also in 1998, NVIDIA introduced distinct products as variants on a single architecture with Riva TNT Ultra (high performance) and Vanta (low cost), first by speed binning and packaging, then with separate chip designs (GeForce 2 GTS and GeForce 2 MX). At present, for a given architecture generation, four or five separate chip designs are needed to cover the range of desktop PC performance and price points. In addition, there are separate segments in notebook and workstation systems. After acquiring 3dfx, NVIDIA continued the multi-GPU SLI concept in 2004 starting with GeForce 6800, providing multi-GPU scalability transparently to both the programmer and to the user. Functional behavior is identical across the scaling range; one application will run unchanged on any implementation of an architectural family.
通过切换到多核轨迹,CPU 通过增加芯片上恒定性能核心的数量来扩展至更高的晶体管数量,而不是提高单个核心的性能。在撰写本文时,业界正在从四核 CPU 过渡到八核 CPU。程序员被迫寻找四倍到八倍的并行性来充分利用这些处理器。其中许多采用粗粒度并行策略,其中应用程序的不同任务并行执行。必须经常重写此类应用程序,以便为核心数量的每次连续加倍提供更多并行任务。相比之下,高度多线程 GPU 鼓励在 CUDA 中使用大规模、细粒度的数据并行性。 GPU 中的高效线程支持允许应用程序公开比可用硬件执行资源更多的并行性,而几乎没有损失或没有损失。 GPU 核心数量每增加一倍,就会提供更多的硬件执行资源,从而利用更多公开的并行性来实现更高的性能。也就是说,用于图形和并行计算的GPU并行编程模型是为了透明和可移植的可扩展性而设计的。图形程序或 CUDA 程序只需编写一次,即可在具有任意数量处理器的 GPU 上运行。
By switching to the multicore trajectory, CPUs are scaling to higher transistor counts by increasing the number of constant-performance cores on a die, rather than increasing the performance of a single core. At this writing the industry is transitioning from quad-core to oct-core CPUs. Programmers are forced to find four-fold to eight-fold parallelism to fully utilize these processors. Many of them resort to coarse-grained parallelism strategies where different tasks of an application are performed in parallel. Such applications must be rewritten often to have more parallel tasks for each successive doubling of core count. In contrast, the highly multithreaded GPUs encourage the use of massive, fine-grained data parallelism in CUDA. Efficient threading support in GPUs allows applications to expose a much larger amount of parallelism than available hardware execution resources with little or no penalty. Each doubling of GPU core count provides more hardware execution resources that exploit more of the exposed parallelism for higher performance. That is, the GPU parallel programming model for graphics and parallel computing is designed for transparent and portable scalability. A graphics program or CUDA program is written once, and runs on a GPU with any number of processors.
关于使用 CUDA 的应用程序的学术和工业工作已经产生了数百个成功的 CUDA 程序示例。其中许多示例均在 GPU计算 Gems、Emerald 和 Jade 版本 [Hwu2011a、Hwu2011b] 中提供,源代码可在www.gpucomputing.net上获取。这些程序在 CPU-GPU 系统上的运行速度通常比单独在 CPU 上快数十倍。随着 MCUDA [SSH2008] 等工具的引入,CUDA 程序的并行线程也可以在多核 CPU 上高效运行,尽管由于浮点执行资源级别较低,速度低于 GPU。这些应用的示例包括n体模拟、分子建模、计算金融和油气藏模拟。尽管其中许多使用单精度浮点运算,但有些问题需要双精度。最新 Fermi 和 Kepler GPU 中的高吞吐量双精度浮点运算使更广泛的应用程序能够从 GPU 加速中受益。
Academic and industrial work on applications using CUDA has produced hundreds of examples of successful CUDA programs. Many of these examples are presented in GPU Computing Gems, Emerald and Jade editions [Hwu2011a, Hwu2011b] with source code available at www.gpucomputing.net. These programs often run tens of times faster on a CPU–GPU system than on a CPU alone. With the introduction of tools like MCUDA [SSH2008], the parallel threads of a CUDA program can also run efficiently on a multicore CPU, although at a lower speed than GPUs due to a lower level of floating-point execution resources. Examples of these applications include n-body simulation, molecular modeling, computational finance, and oil/gas reservoir simulation. Although many of these use single-precision floating-point arithmetic, some problems require double precision. The high-throughput double-precision floating-point arithmetic in more recent Fermi and Kepler GPUs enabled an even broader range of applications to benefit from GPU acceleration.
当然,随着硅工艺的改进,处理器核心的数量将继续与可用晶体管的增加成比例地增加。此外,GPU 将继续经历剧烈的架构演变。尽管 GPU 核心处理器在数据并行应用程序上表现出了高性能,但其设计仍然相对简单。每一代都会引入更积极的技术,以提高计算单元的实际利用率。由于 GPU 上的可扩展并行计算仍然是一个年轻的领域,因此新颖的应用程序正在迅速创建。通过研究它们,GPU 设计人员将继续发现并实施新的机器优化。
Naturally, the number of processor cores will continue to increase in proportion to increases in available transistors as silicon processes improve. In addition, GPUs will continue to go through vigorous architectural evolution. Despite their demonstrated high performance on data-parallel applications, GPU core processors are still of relatively simple design. More aggressive techniques will be introduced with each successive generation to increase the actual utilization of the calculating units. Because scalable parallel computing on GPUs is a still a young field, novel applications are rapidly being created. By studying them, GPU designers will continue to discover and implement new machine optimizations.
1. Akeley K、Jermoluk T。高性能多边形渲染。程序 SIGGRAPH 1988 1998:239–246。
1. Akeley K, Jermoluk T. High-Performance polygon rendering. Proc SIGGRAPH 1988 1998:239–246.
2. Akeley K. RealityEngine 图形。程序 SIGGRAPH 1993 1993:109–116。
2. Akeley K. RealityEngine graphics. Proc SIGGRAPH 1993 1993:109–116.
3. 布莱洛赫英国。前缀和及其应用。见:Reif JH,编辑。并行算法的综合。旧金山:摩根·考夫曼; 1990年。
3. Blelloch GB. Prefix sums and their applications. In: Reif JH, ed. Synthesis of parallel algorithms. San Francisco: Morgan Kaufmann; 1990.
4. Blythe D. direct3D 10 系统。ACM 跨图形。 2006;25(3):724–734。
4. Blythe D. The direct3D 10 system. ACM Trans Graphics. 2006;25(3):724–734.
5. Buck I、Foley T、Horn D 等。 Brook for GPU:图形硬件上的流计算。Proc SIGGRAPH 2004 2004:777–786 也可从以下网址获取: < http://doi.acm.org/10.1145/1186562.1015800 >。
5. Buck I, Foley T, Horn D, et al. Brook for GPUs: Stream computing on graphics hardware. Proc SIGGRAPH 2004 2004:777–786 also Available at: <http://doi.acm.org/10.1145/1186562.1015800>.
6. G 长老“Radeon 9700”,eurographics/SIGGRAPH 图形硬件研讨会。Hot3D 会议2002;网址:< http://www.graphicshardware.org/previous/www_2002/presentations/Hot3D-RADEON9700.ppt >。
6. Elder G. “Radeon 9700,” eurographics/SIGGRAPH workshop on graphics hardware. Hot3D Session 2002; Available at: <http://www.graphicshardware.org/previous/www_2002/presentations/Hot3D-RADEON9700.ppt>.
7.费尔南多·R,基尔加德·MJ。 Cg 教程:可编程实时图形的权威指南。马萨诸塞州雷丁:艾迪生-韦斯利; 2003年。
7. Fernando R, Kilgard MJ. The Cg tutorial: The definitive guide to programmable real-time graphics. Reading, MA: Addison-Wesley; 2003.
8.费尔南多·R,编辑。Gpu gems:实时图形的编程技术、提示和技巧。马萨诸塞州雷丁:艾迪生-韦斯利; 2004年;也可访问:< http://developer.nvidia.com/object/gpu_gems_home.html >。
8. Fernando R, ed. Gpu gems: Programming techniques, tips, and tricks for real-time graphics. Reading, MA: Addison-Wesley; 2004; also Available at: <http://developer.nvidia.com/object/gpu_gems_home.html>.
9. Foley J、van Dam A、Feiner S、Hughes J。计算机图形学:原理与实践,C 阅读第二版,马萨诸塞州:Addison-Wesley; 1995年。
9. Foley J, van Dam A, Feiner S, Hughes J. Computer graphics: Principles and practice, second edition in C Reading, MA: Addison-Wesley; 1995.
10.希利斯·WD,斯蒂尔·GL。数据并行算法。社区 ACM。 1986;29(12):1170–1183< http://doi.acm.org/10.1145/7902.7903 >。
10. Hillis WD, Steele GL. Data parallel algorithms. Commun ACM. 1986;29(12):1170–1183 <http://doi.acm.org/10.1145/7902.7903>.
11. IEEE 754R 工作组。浮点运算 P754 的草案标准。 < http://www.validlab.com/754R/drafts/archive/2006-10-04.pdf >。
11. IEEE 754R working group. DRAFT standard for floating-point arithmetic P754. <http://www.validlab.com/754R/drafts/archive/2006-10-04.pdf>.
12.工业光与魔法。OpenEXR 2003;网址:< //www.openexr.com >。
12. Industrial light and magic. OpenEXR 2003; Available at: <//www.openexr.com>.
13.英特尔公司。Intel 64 和 IA-32 架构优化参考手册2007;网址:< http://www3.intel.com/design/processor/manuals/248966.pdf >。
13. Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual 2007; Available at: <http://www3.intel.com/design/processor/manuals/248966.pdf>.
14. Kessenich J. Opengl 着色语言,语言版本 1.20 2006;位于:< http://www.opengl.org/documentation/specs/ >。
14. Kessenich J. The Opengl Shading Language, Language Version 1.20 2006; Available at: <http://www.opengl.org/documentation/specs/>.
15. Kirk D、Voorhies D。DN10000VS 的渲染架构。SIGGRAPH 程序 1990 1990;299–307。
15. Kirk D, Voorhies D. The rendering architecture of the DN10000VS. Proc SIGGRAPH 1990 1990;299–307.
16. Lindholm E、Kilgard MJ、Moreton H。用户可编程顶点引擎。程序 SIGGRAPH 2001 2001;149–158。
16. Lindholm E, Kilgard MJ, Moreton H. A user-programmable vertex engine. Proc SIGGRAPH 2001 2001;149–158.
17. Lindholm E、Nickolls J、Oberman S、Montrym J。NVIDIA 特斯拉:统一的图形和计算架构。IEEE 微. 2008;28(2):39–55。
17. Lindholm E, Nickolls J, Oberman S, Montrym J. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro. 2008;28(2):39–55.
18. 微软公司。 Microsoft DirectX 规范,可从以下位置获取: < http://msdn.microsoft.com/directx/ > 。
18. Microsoft corporation. Microsoft DirectX Specification, Available at: <http://msdn.microsoft.com/directx/>.
19. 微软公司。 Microsoft directx 9 可编程图形管道 华盛顿州雷德蒙德:微软出版社; 2003年。
19. Microsoft Corporation. Microsoft directx 9 programmable graphics pipeline Readmond, WA: Microsoft Press; 2003.
20. Montrym J、Baum D、Dignam D、Migdal C。InfiniteReality:实时图形系统。SIGGRAPH 程序 1997 1997;293–301。
20. Montrym J, Baum D, Dignam D, Migdal C. InfiniteReality: A real-time graphics system. Proc SIGGRAPH 1997 1997;293–301.
21. Montrym J、Moreton H。GeForce 6800。IEEE Micro。 2005;25(2):41-51。
21. Montrym J, Moreton H. The GeForce 6800. IEEE Micro. 2005;25(2):41–51.
22.摩尔·GE。将更多元件塞进集成电路中。电子产品。 1965;38 可在 < http://download.intel.com/museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_Article.pdf > 获取。
22. Moore GE. Cramming more components onto integrated circuits. Electronics. 1965;38 Avaialble at <http://download.intel.com/museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_Article.pdf>.
23. 阮 H 编。GPU 宝石 3 .马萨诸塞州雷丁:艾迪生-韦斯利; 2008年。
23. Nguyen H, ed. GPU gems 3. Reading, MA: Addison-Wesley; 2008.
24. Nickolls J、Buck I、Garland M、Skadron K。使用 CUDA 的可扩展并行编程。ACM 队列。 2008;6(2):40–53。
24. Nickolls J, Buck I, Garland M, Skadron K. Scalable parallel programming with CUDA. ACM Queue. 2008;6(2):40–53.
25.英伟达。CUDA 专区2012;网址为:http://www.nvidia.com/CUDA。
25. NVIDIA. CUDA Zone 2012; Available at: http://www.nvidia.com/CUDA.
26.英伟达。CUDA 编程指南 1.1 2007;位于:< http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf >。
26. NVIDIA. CUDA Programming Guide 1.1 2007; Available at: <http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf>.
27.英伟达。PTX:并行线程执行 ISA 版本 1.1 2007;位于:< http://www.nvidia.com/object/io_1195170102263.html >。
27. NVIDIA. PTX: Parallel Thread Execution ISA Version 1.1 2007; Available at: <http://www.nvidia.com/object/io_1195170102263.html>.
28. Nyland L、Harris M、Prins J。使用 CUDA 进行快速 N 体模拟。见:Nguyen H,编辑。GPU 宝石 3 .马萨诸塞州雷丁:艾迪生-韦斯利; 2007年。
28. Nyland L, Harris M, Prins J. Fast N-Body simulation with CUDA. In: Nguyen H, ed. GPU gems 3. Reading, MA: Addison-Wesley; 2007.
29. Oberman, SF 和 Siu, MY 高性能面积高效多功能插值器。过程。第 17 届 IEEE symp。计算机算术(第 272-279 页)。西雅图华盛顿,2005 年。
29. Oberman, S. F.and Siu,M. Y. A high-performance area-efficient multifunction interpolator. Proc. 17th IEEE symp. computer arithmetic (pp. 272–279). Seattle Washington, 2005.
30. 帕特森 DA,轩尼诗 JL。计算机组织和设计:硬件/软件接口第三版。旧金山:摩根·考夫曼; 2004年。
30. Patterson DA, Hennessy JL. Computer organization and design: The hardware/software interface 3rd ed. San Francisco: Morgan Kaufmann; 2004.
31. Pharr M,编辑。GPU Gems 2:高性能图形和通用计算的编程技术。马萨诸塞州雷丁:艾迪生-韦斯利; 2005年。
31. Pharr M, ed. GPU Gems 2: Programming techniques for high-performance graphics and general-purpose computation. Reading, MA: Addison-Wesley; 2005.
32. Satish, N. Harris, M. 和 Garland, M. 为第 23 届 IEEE 国际并行和分布式处理研讨会的议程设计高效排序算法。意大利罗马,2009 年。
32. Satish, N. Harris, M. and Garland, M. Designing efficient sorting algorithms for proceedings of the 23rd ieee international parallel and distributed processing symposium. Rome, Italy, 2009.
33. Segal M,Akeley K。opengl图形系统:A 规范,版本 2.1 2006;位于:< http://www.opengl.org/documentation/specs/ >。
33. Segal M, Akeley K. The opengl graphics system: A specification, version 2.1 2006; Available at: <http://www.opengl.org/documentation/specs/>.
34. Sengupta、S. Harris、M. Zhang、Y. 和 Owens,JD 用于 GPU 计算的扫描原语。过程。 2007 年图形硬件(第 97-106 页)。加利福尼亚州圣地亚哥,2007 年 8 月。
34. Sengupta, S. Harris, M. Zhang, Y.and Owens, J. D. Scan primitives for GPU computing. Proc. of graphics hardware 2007 (pp. 97–106). San Diego, California, Aug. 2007.
35. Hwu W,编辑。 GPU计算宝石,翡翠版。旧金山:摩根·考夫曼; 2011年。
35. Hwu W, ed. GPU computing gems, emerald edition. San Francisco: Morgan Kauffman; 2011.
36. Hwu W,编辑。 GPU计算宝石、玉石版。旧金山:摩根·考夫曼; 2011年。
36. Hwu W, ed. GPU computing gems, jade edition. San Francisco: Morgan Kauffman; 2011.
37. 斯特拉顿 JA、斯通 SS、胡 WW。 MCUDA:多核 CPU 的 CUDA 内核的高效实现。 2008 年第 21 届并行计算语言和编译器国际研讨会; [加拿大;也可作为计算机科学的讲义]。
37. Stratton JA, Stone SS, Hwu WW. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. The 21st International Workshop on Languages and Compilers for Parallel Computing 2008; [Canada; also Available as Lecture Notes in Computer Science].
38. Volkov V.and Demmel, J. LU,使用 GPU 向量功能的 QR 和 cholesky 分解。技术报告编号 UCB/EECS-2008-49, 1–11;也可访问:< http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html >。
38. Volkov V.and Demmel, J. LU, QR and cholesky factorizations using vector capabilities of GPUs. Technical Report No. UCB/EECS-2008-49, 1–11; also Available at: <http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html>.
39. Williams S、Oliker L、Vuduc R、Shalf J、Yelick K、Demmel J。新兴多核平台上稀疏矩阵向量乘法的优化。Proc 超级计算 2007 (SC'07) 2007。doi 10.1145/1362622.1362674 [内华达州里诺]。
39. Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Proc Supercomputing 2007 (SC’07) 2007. doi 10.1145/1362622.1362674 [Reno, Nevada].
3.1 数据并行性 42
3.1 Data Parallelism 42
3.2 CUDA程序结构
3.2 CUDA Program Structure
3.3 向量加法内核
3.3 A Vector Addition Kernel
3.4 设备全局内存和数据传输
3.4 Device Global Memory and Data Transfer
3.5 内核函数和线程
3.5 Kernel Functions and Threading
3.6 概括
3.6 Summary
3.7 练习
3.7 Exercises
我们的主要目标是教授在异构计算系统中编写大规模并行程序所涉及的关键概念。这需要用相当简单的语言表达许多代码示例,支持大规模并行性和异构计算。我们选择 CUDA C 作为我们的代码示例和练习。 CUDA C 是流行的 C 编程语言1的扩展,具有新的关键字和应用程序编程接口,供程序员利用包含 CPU 和大规模并行 GPU 的异构计算系统。在本书的其余部分中,我们将 CUDA C 简称为 CUDA。对于CUDA程序员来说,计算系统由一台主机(即传统的CPU,例如当今个人电脑中的Intel架构微处理器)和一个或多个设备(即带有大量运算单元的处理器)组成。 CUDA 设备通常是 GPU。许多现代软件应用程序都有展示大量数据并行性的部分,这种现象允许算术运算在数据结构的不同部分上并行安全地执行。 CUDA 设备通过将大量算术单元应用于这些数据并行程序部分来加速这些应用程序的执行。由于数据并行在CUDA中起着如此重要的作用,因此在介绍CUDA的基本特性之前,我们首先讨论数据并行的概念。
Our main objective is to teach the key concepts involved in writing massively parallel programs in a heterogeneous computing system. This requires many code examples expressed in a reasonably simple language that supports massive parallelism and heterogeneous computing. We have chosen CUDA C for our code examples and exercises. CUDA C is an extension to the popular C programming language1 with new keywords and application programming interfaces for programmers to take advantage of heterogeneous computing systems that contain both CPUs and massively parallel GPU’s. For the rest of this book, we will refer to CUDA C simply as CUDA. To a CUDA programmer, the computing system consists of a host that is a traditional CPU, such as an Intel architecture microprocessor in personal computers today, and one or more devices that are processors with a massive number of arithmetic units. A CUDA device is typically a GPU. Many modern software applications have sections that exhibit a rich amount of data parallelism, a phenomenon that allows arithmetic operations to be safely performed on different parts of the data structures in parallel. CUDA devices accelerate the execution of these applications by applying their massive number of arithmetic units to these data-parallel program sections. Since data parallelism plays such an important role in CUDA, we will first discuss the concept of data parallelism before introducing the basic features of CUDA.
现代软件应用程序通常会处理大量数据,并在顺序计算机上导致较长的执行时间。其中许多都对代表或模拟现实世界物理现象的数据进行操作。图像和视频帧是物理世界的快照,其中图片的不同部分捕获同时、独立的物理事件。刚体物理和流体动力学模拟自然力和运动,可以在小时间步长内独立评估。航空公司调度涉及并行运行的数千个航班、机组人员和机场登机口。这种独立评估是这些应用程序中数据并行性的基础。
Modern software applications often process a large amount of data and incur long execution time on sequential computers. Many of them operate on data that represents or models real-world, physical phenomena. Images and video frames are snapshots of a physical world where different parts of a picture capture simultaneous, independent physical events. Rigid-body physics and fluid dynamics model natural forces and movements that can be independently evaluated within small time steps. Airline scheduling deals with thousands of flights, crews, and airport gates that operate in parallel. Such independent evaluation is the basis of data parallelism in these applications.
任务并行与数据并行
Task Parallelism versus Data Parallelism
数据并行并不是并行编程中广泛使用的唯一并行类型。任务并行性也广泛应用于并行编程中。任务并行性通常通过应用程序的任务分解来暴露。例如,一个简单的应用程序可能需要进行向量加法和矩阵向量乘法。其中每一个都是一项任务。如果两个任务可以独立完成,则存在任务并行性。
Data parallelism is not the only type of parallelism widely used in parallel programming. Task parallelism has also been used extensively in parallel programming. Task parallelism is typically exposed through task decomposition of applications. For example, a simple application may need to do a vector addition and a matrix–vector multiplication. Each of these would be a task. Task parallelism exists if the two tasks can be done independently.
在大型应用程序中,通常存在大量独立任务,因此任务并行性也较大。例如,在分子动力学模拟器中,自然任务列表包括振动力、旋转力、非键合力的邻居识别、非键合力、速度和位置以及基于速度和位置的其他物理属性。
In large applications, there are usually a larger number of independent tasks and therefore a larger amount of task parallelism. For example, in a molecular dynamics simulator, the list of natural tasks includes vibrational forces, rotational forces, neighbor identification for nonbonding forces, nonbonding forces, velocity and position, and other physical properties based on velocity and position.
一般来说,数据并行性是并行程序可扩展性的主要来源。对于大型数据集,人们通常可以找到丰富的数据并行性,以便能够利用大规模并行处理器,并允许应用程序性能随着每一代具有更多执行资源的硬件而增长。尽管如此,任务并行性也可以在实现性能目标方面发挥重要作用。稍后当我们介绍 CUDA 流时,我们将讨论任务并行性。
In general, data parallelism is the main source of scalability for parallel programs. With large data sets, one can often find abundant data parallelism to be able to utilize massively parallel processors and allow application performance to grow with each generation of hardware that has more execution resources. Nevertheless, task parallelism can also play an important role in achieving performance goals. We will be covering task parallelism later when we introduce CUDA streams.
让我们用图 3.1中的向量加法示例来说明数据并行性的概念。在此示例中,和向量 C 的每个元素是通过将输入向量 A 的元素与输入向量 B 的元素相加来生成的。例如,通过将 A[0] 与 B[0] 相加来生成 C[0], C[3]是A[3]与B[3]相加生成的。所有添加都可以在平行线。因此,两个大向量的向量相加表现出丰富的数据并行性。实际应用中的数据并行可能更复杂,稍后将详细讨论。
Let us illustrate the concept of data parallelism with a vector addition example in Figure 3.1. In this example, each element of the sum vector C is generated by adding an element of input vector A to an element of input vector B. For example, C[0] is generated by adding A[0] to B[0], and C[3] is generated by adding A[3] to B[3]. All additions can be performed in parallel. Therefore, vector addition of two large vectors exhibits a rich amount of data parallelism. Data parallelism in real applications can be more complex and will be discussed in detail later.
图 3.1向量加法中的数据并行性。
Figure 3.1 Data parallelism in vector addition.
CUDA程序的结构反映了计算机中主机(CPU)和一个或多个设备(GPU)的共存。每个 CUDA 源文件都可以混合主机和设备代码。默认情况下,任何传统 C 程序都是仅包含主机代码的 CUDA 程序。可以将设备函数和数据声明添加到任何 C 源文件中。设备的函数或数据声明用特殊的 CUDA 关键字清楚地标记。这些通常是表现出大量数据并行性的函数。
The structure of a CUDA program reflects the coexistence of a host (CPU) and one or more devices (GPUs) in the computer. Each CUDA source file can have a mixture of both host and device code. By default, any traditional C program is a CUDA program that contains only host code. One can add device functions and data declarations into any C source file. The function or data declarations for the device are clearly marked with special CUDA keywords. These are typically functions that exhibit a rich amount of data parallelism.
一旦将设备函数和数据声明添加到源文件中,传统的 C 编译器就不再可接受。代码需要由能够识别并理解这些附加声明的编译器来编译。我们将使用 NVIDIA 的 CUDA C 编译器,称为 NVCC(NVIDIA C 编译器)。如图3.2顶部所示,NVCC处理CUDA程序,使用CUDA关键字来分隔主机代码和设备代码。主机代码是直接的 ANSI C 代码,使用主机的标准 C/C++ 编译器进一步编译,并作为传统 CPU 进程运行。设备代码用 CUDA 关键字标记,用于标记数据并行函数(称为内核)及其关联的数据结构。设备代码由 NVCC 的运行时组件进一步编译,在 GPU 设备上执行。在没有可用设备或内核可以在 CPU 上正确执行的情况下,人们还可以选择使用 MCUDA [Stratton 2008]等工具在 CPU 上执行内核。
Once device functions and data declarations are added to a source file, it is no longer acceptable to a traditional C compiler. The code needs to be compiled by a compiler that recognizes and understands these additional declarations. We will be using a CUDA C compiler by NVIDIA called NVCC (NVIDIA C Compiler). As shown at the top of Figure 3.2, the NVCC processes a CUDA program, using the CUDA keywords to separate the host code and device code. The host code is straight ANSI C code, which is further compiled with the host’s standard C/C++ compilers and is run as a traditional CPU process. The device code is marked with CUDA keywords for labeling data-parallel functions, called kernels, and their associated data structures. The device code is further compiled by a runtime component of NVCC and executed on a GPU device. In situations where there is no device available or a kernel can be appropriately executed on a CPU, one can also choose to execute the kernel on a CPU using tools like MCUDA [Stratton 2008].
图 3.2 CUDA 程序的编译过程概述。
Figure 3.2 Overview of the compilation process of a CUDA program.
CUDA 程序的执行如图3.3所示。执行从主机(CPU)执行开始。当调用或启动内核函数时,它由设备上的大量线程执行。内核启动生成的所有线程统称为网格。图 3.3显示了两个线程网格的执行情况。我们将很快讨论这些网格是如何组织的。当内核的所有线程完成执行时,相应的网格终止,并且执行在主机上继续,直到启动另一个内核。请注意,图 3.3显示了一个简化模型,其中 CPU 执行和 GPU 执行不重叠。许多异构计算应用程序实际上管理重叠的 CPU 和 GPU 执行,以充分利用 CPU 和 GPU。
The execution of a CUDA program is illustrated in Figure 3.3. The execution starts with host (CPU) execution. When a kernel function is called, or launched, it is executed by a large number of threads on a device. All the threads that are generated by a kernel launch are collectively called a grid. Figure 3.3 shows the execution of two grids of threads. We will discuss how these grids are organized soon. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is launched. Note that Figure 3.3 shows a simplified model where the CPU execution and the GPU execution do not overlap. Many heterogeneous computing applications actually manage overlapped CPU and GPU execution to take advantage of both CPUs and GPUs.
线程数
Threads
线程是现代计算机中处理器如何执行程序的简化视图。线程由程序代码、正在执行的代码中的特定点及其变量和数据结构的值组成。就用户而言,线程的执行是顺序的。人们可以使用源代码级调试器来监视线程的进度,方法是一次执行一个语句,查看接下来将执行的语句,并检查变量和数据结构的值。
A thread is a simplified view of how a processor executes a program in modern computers. A thread consists of the code of the program, the particular point in the code that is being executed, and the values of its variables and data structures. The execution of a thread is sequential as far as a user is concerned. One can use a source-level debugger to monitor the progress of a thread by executing one statement at a time, looking at the statement that will be executed next, and checking the values of the variables and data structures.
线程在传统 CPU 编程中的应用已经很多年了。如果程序员想要在应用程序中开始并行执行,他或她需要使用线程库或特殊语言创建和管理多个线程。
Threads have been used in traditional CPU programming for many years. If a programmer wants to start parallel execution in an application, he or she needs to create and manage multiple threads using thread libraries or special languages.
在 CUDA 中,每个线程的执行也是顺序的。 CUDA 程序通过启动内核函数来启动并行执行,这会导致底层运行时机制创建许多线程来并行处理数据的不同部分。
In CUDA, the execution of each thread is sequential as well. A CUDA program initiates parallel execution by launching kernel functions, which causes the underlying runtime mechanisms to create many threads that process different parts of the data in parallel.
图 3.3 CUDA 程序的执行。
Figure 3.3 Execution of a CUDA program.
启动内核通常会生成大量线程来利用数据并行性。在向量加法示例中,每个线程可用于计算输出向量 C 的一个元素。在这种情况下,内核将生成的线程数等于向量长度。对于长向量,会产生大量线程。由于高效的硬件支持,CUDA 程序员可以假设这些线程只需很少的时钟周期即可生成和调度。这与传统的 CPU 线程形成鲜明对比,传统的 CPU 线程通常需要数千个时钟周期来生成和调度。
Launching a kernel typically generates a large number of threads to exploit data parallelism. In the vector addition example, each thread can be used to compute one element of the output vector C. In this case, the number of threads that will be generated by the kernel is equal to the vector length. For long vectors, a large number of threads will be generated. CUDA programmers can assume that these threads take very few clock cycles to generate and schedule due to efficient hardware support. This is in contrast with traditional CPU threads that typically take thousands of clock cycles to generate and schedule.
我们现在使用向量加法来说明 CUDA 编程模型。在我们展示向量加法的内核代码之前,首先回顾一下传统的仅 CPU 向量加法函数是如何工作的会很有帮助。图 3.4显示了一个简单的传统 C 程序,它由一个主函数和一个向量加法函数组成。在每一段主机代码中,我们都会为主要由主机处理的变量名称添加前缀和h_以及主要由设备d_处理的变量,以提醒我们这些变量的用途。
We now use vector addition to illustrate the CUDA programming model. Before we show the kernel code for vector addition, it is helpful to first review how a conventional CPU-only vector addition function works. Figure 3.4 shows a simple traditional C program that consists of a main function and a vector addition function. In each piece of host code, we will prefix the names of variables that are mainly processed by the host with h_ and those of variables that are mainly processed by a device d_ to remind ourselves the intended usage of these variables.
图 3.4一个简单的传统向量加法 C 代码示例。
Figure 3.4 A simple traditional vector addition C code example.
假设要相加的向量存储在主程序中分配并初始化的数组h_A和h_B中。输出向量位于数组h_C中,该数组也在主程序中初始化。为简洁起见,我们没有显示h_A、V和h_C如何分配或初始化的详细信息。附录 A提供了包含更多详细信息的完整源代码列表。指向这些数组的指针连同包含向量长度的变量N一起传递给vecAdd()函数。
Assume that the vectors to be added are stored in arrays h_A and h_B that are allocated and initialized in the main program. The output vector is in array h_C, which is also initialized in the main program. For brevity, we do not show the details of how h_A, V, and h_C are allocated or initialized. A complete source code listing that contains more details is available in Appendix A. The pointers to these arrays are passed to the vecAdd() function, along with the variable N that contains the length of the vectors.
C语言中的指针
Pointers in the C Language
图 3.4中的函数参数 A、B 和 C是指针。在C语言中,可以使用指针来访问变量和数据结构。虽然浮点变量 V 可以用以下方式声明:
The function arguments A, B, and C in Figure 3.4 are pointers. In the C language, a pointer can be used to access variables and data structures. While a floating point variable V can be declared wit:
浮动V;
float V;
指针变量 P 可以这样声明:
a pointer variable P can be declared with:
浮动*P;
float ∗P;
通过使用语句 P = & V 将 V 的地址分配给 P,我们使 P “指向”V。*P 成为 V 的同义词。例如,U = *P 将 V 的值分配给 U。另一个例子, *P = 3 将 V 的值更改为 3。 C 程序中的数组可以通过指向其第 0 个元素的指针来访问。例如,语句 P = & (h_A[0]) 使 P 指向数组 h_A 的第 0 个元素。 P[i] 成为 h_A[i] 的同义词。事实上,数组名 h_A 本身就是一个指向其第 0 个元素的指针。在图 3.4中,将数组名 h_A 作为第一个参数传递给 vecAdd 函数调用,使得函数的第一个参数 A 指向 h_A 的第 0 个元素。我们说 h_A 是通过引用传递给 vecAdd 的。这样,函数体中的A[i]就可以用来访问h_A[i]。请参阅 Patt& Patel [Patt],了解 C 中指针的详细用法的简单易懂的解释。
By assigning the address of V to P with the statement P = & V, we make P “point to” V. ∗P becomes a synonym for V. For example U = ∗P assigns the value of V to U. For another example, ∗P = 3 changes the value of V to 3. An array in a C program can be accessed through a pointer that points to its 0th element. For example, the statement P = & (h_A[0]) makes P point to the 0th element of array h_A. P[i] becomes a synonym for h_A[i]. In fact, the array name h_A is in itself a pointer to its 0th element. In Figure 3.4, passing an array name h_A as the first argument to function call to vecAdd makes the function’s first parameter A point to the 0th element of h_A. We say that h_A is passed by reference to vecAdd. As a result, A[i] in the function body can be used to access h_A[i]. See Patt& Patel [Patt] for an easy-to-follow explanation of the detailed usage of pointers in C.
图 3.4中的vecAdd()函数使用for循环来迭代向量元素。在第i次迭代中,输出元素C[i]接收A[i]和B[i]的和。向量长度参数n用于控制循环,使迭代次数与向量的长度相匹配。参数A、B和C通过引用传递,因此函数通过参数指针A、B和C读取h_A、h_B的元素并写入h_C的元素。当vecAdd()函数返回时,主函数中的后续语句可以访问h_C的新内容。
The vecAdd() function in Figure 3.4 uses a for loop to iterate through the vector elements. In the ith iteration, output element C[i] receives the sum of A[i] and B[i]. The vector length parameter n is used to control the loop so that the number of iterations matches the length of the vectors. The parameters A, B, and C are passed by reference so the function reads the elements of h_A, h_B and writes the elements of h_C through the parameter pointers A, B, and C. When the vecAdd() function returns, the subsequent statements in the main function can access the new contents of h_C.
并行执行向量加法的一种直接方法是修改vecAdd()函数并将其计算移至 CUDA 设备。这种修改后的vecAdd()函数的结构如图3.5所示。在文件的开头,我们需要添加 C 预处理器指令以包含CUDA.h头文件。该文件定义了我们将很快介绍的 CUDA API 函数和内置变量。该函数的第 1 部分在设备 (GPU) 内存中分配空间来保存 A、B 和 C 向量的副本,并将向量从主机内存复制到设备内存。第 2 部分在设备上启动实际向量加法内核的并行执行。第 3 部分将和向量 C 从设备内存复制回主机内存。
A straightforward way to execute vector addition in parallel is to modify the vecAdd() function and move its calculations to a CUDA device. The structure of such a modified vecAdd() function is shown in Figure 3.5. At the beginning of the file, we need to add a C preprocessor directive to include the CUDA.h header file. This file defines the CUDA API functions and built-in variables that we will be introducing soon. Part 1 of the function allocates space in the device (GPU) memory to hold copies of the A, B, and C vectors, and copies the vectors from the host memory to the device memory. Part 2 launches parallel execution of the actual vector addition kernel on the device. Part 3 copies the sum vector C from the device memory back to the host memory.
图 3.5将工作移至设备的修订版vecAdd()函数的轮廓。
Figure 3.5 Outline of a revised vecAdd() function that moves the work to a device.
请注意,修改后的vecAdd()函数本质上是一个外包代理,它将输入数据发送到设备,激活设备上的计算,并从设备收集结果。代理这样做的方式是主程序甚至不需要知道矢量加法现在实际上是在设备上完成的。修订后的详细内容我们将在介绍 CUDA 编程模型的基本特征时介绍函数以及核函数的编写方式。
Note that the revised vecAdd() function is essentially an outsourcing agent that ships input data to a device, activates the calculation on the device, and collects the results from the device. The agent does so in such a way that the main program does not need to even be aware that the vector addition is now actually done on a device. The details of the revised function, as well as the way to compose the kernel function, will be shown as we introduce the basic features of the CUDA programming model.
在CUDA中,主机和设备具有独立的内存空间。这反映了当前的现实,即设备通常是带有自己的 DRAM 的硬件卡。例如,NVIDIA GTX480 配备高达 4 GB 2(十亿字节或千兆字节)的 DRAM,称为全局内存。我们还将全局内存称为设备内存。为了在设备上执行内核,程序员需要在设备上分配全局内存,并将相关数据从主机内存传输到分配的设备内存。这对应于图 3.5的第 1 部分。类似地,在设备执行之后,编程器需要将结果数据从设备存储器传输回主机存储器,并释放不再需要的设备存储器。这对应于图 3.5的第 3 部分。 CUDA 运行时系统提供应用程序编程接口 (API) 函数来代表程序员执行这些活动。从现在开始,我们简单地说,一段数据从主机传输到设备,作为数据从主机内存复制到设备内存的简写。对于相反的方向也是如此。
In CUDA, host and devices have separate memory spaces. This reflects the current reality that devices are often hardware cards that come with their own DRAM. For example, the NVIDIA GTX480 comes with up to 4 GB2 (billion bytes, or gigabytes) of DRAM, called global memory. We will also refer to global memory as device memory. To execute a kernel on a device, the programmer needs to allocate global memory on the device and transfer pertinent data from the host memory to the allocated device memory. This corresponds to Part 1 of Figure 3.5. Similarly, after device execution, the programmer needs to transfer result data from the device memory back to the host memory and free up the device memory that is no longer needed. This corresponds to Part 3 of Figure 3.5. The CUDA runtime system provides Application Programming Interface (API) functions to perform these activities on behalf of the programmer. From this point on, we will simply say that a piece of data is transferred from host to device as shorthand for saying that the data is copied from the host memory to the device memory. The same holds for the opposite direction.
图 3.6显示了 CUDA 主机内存和设备内存模型,供程序员推断设备内存的分配以及主机和设备之间的内存移动。主机可以访问设备全局存储器,以将数据传输到设备或从设备传输数据,如这些存储器和设备之间的双向箭头所示。主机如图3.6。设备内存类型比图 3.6所示的要多。常量存储器可以通过设备函数以只读方式访问,这将在第 8 章中介绍。我们还将在第 5 章中讨论寄存器和共享内存的使用。有关纹理内存的功能,请参阅CUDA 编程指南。现在,我们将重点关注全局内存的使用。
Figure 3.6 shows a CUDA host memory and device memory model for programmers to reason about the allocation of device memory and movement of memory between host and device. The device global memory can be accessed by the host to transfer data to and from the device, as illustrated by the bidirectional arrows between these memories and the host in Figure 3.6. There are more device memory types than shown in Figure 3.6. Constant memory can be accessed in a read-only manner by device functions, which will be described in Chapter 8. We will also discuss the use of registers and shared memory in Chapter 5. See the CUDA Programming Guide for the functionality of texture memory. For now, we will focus on the use of global memory.
图 3.6主机内存和设备全局内存。
Figure 3.6 Host memory and device global memory.
CUDA运行时系统提供API函数来管理设备内存中的数据。例如图3.5中vecAdd()函数的第1部分和第3部分需要使用这些API函数为A、B、C分配设备内存;将 A 和 B 从主机内存传输到设备内存;将 C 从设备内存传输到主机内存;并释放A、B、C的设备内存。我们先解释一下内存分配和释放函数。图 3.7显示了两个用于分配和释放设备全局内存的 API 函数。可以从主机代码调用函数cudaMalloc()为对象分配一块设备全局内存。读者应该注意到cudaMalloc()和标准 C 运行时库malloc()之间的惊人相似之处。这是故意的; CUDA 是带有最少扩展的 C 语言。 CUDA使用标准C运行时库malloc()函数来管理主机内存,并将cudaMalloc()作为C运行时库的扩展添加。通过使接口尽可能接近原始 C 运行时库,CUDA 最大限度地减少了 C 程序员重新学习这些扩展的使用所花费的时间。
The CUDA runtime system provides API functions for managing data in the device memory. For example, Parts 1 and 3 of the vecAdd() function in Figure 3.5 need to use these API functions to allocate device memory for A, B, and C; transfer A and B from host memory to device memory; transfer C from device memory to host memory; and free the device memory for A, B, and C. We will explain the memory allocation and free functions first. Figure 3.7 shows two API functions for allocating and freeing device global memory. Function cudaMalloc() can be called from the host code to allocate a piece of device global memory for an object. Readers should notice the striking similarity between cudaMalloc() and the standard C runtime library malloc(). This is intentional; CUDA is C with minimal extensions. CUDA uses the standard C runtime library malloc() function to manage the host memory and adds cudaMalloc() as an extension to the C runtime library. By keeping the interface as close to the original C runtime libraries as possible, CUDA minimizes the time that a C programmer spends to relearn the use of these extensions.
图 3.7用于管理设备全局内存的 CUDA API 函数。
Figure 3.7 CUDA API functions for managing device global memory.
cudaMalloc()函数的第一个参数是指针变量的地址,该变量将被设置为指向分配的对象。指针变量的地址应转换为(void ***),因为函数需要一个通用指针;内存分配函数是一个通用函数,不限于任何特定类型的对象。3该参数允许cudaMalloc()函数将分配的内存的地址写入指针变量。4主机代码将此指针值传递给需要访问分配内存的内核目的。cudaMalloc()函数的第二个参数给出要分配的数据的大小(以字节为单位)。第二个参数的用法与 C malloc()函数的大小参数一致。
The first parameter to the cudaMalloc() function is the address of a pointer variable that will be set to point to the allocated object. The address of the pointer variable should be cast to (void ∗∗) because the function expects a generic pointer; the memory allocation function is a generic function that is not restricted to any particular type of objects.3 This parameter allows the cudaMalloc() function to write the address of the allocated memory into the pointer variable.4 The host code passes this pointer value to the kernels that need to access the allocated memory object. The second parameter to the cudaMalloc() function gives the size of the data to be allocated, in terms of bytes. The usage of this second parameter is consistent with the size parameter to the C malloc() function.
现在我们用一个简单的代码示例来说明cudaMalloc()的使用。这是图 3.5中示例的延续。为了清楚起见,我们将以d_开始一个指针变量,以指示它指向设备内存中的一个对象。程序将d_A(即&d_A)的地址转换为 void 指针后将其作为第一个参数传递。也就是说,d_A将指向为A向量分配的设备内存区域。分配区域的大小将是单精度浮点数大小的n倍,在当今大多数计算机中为 4 字节。计算完成后,以指针d_A作为输入调用cudaFree(),以从设备全局内存中释放 A 向量的存储空间。
We now use a simple code example to illustrate the use of cudaMalloc(). This is a continuation of the example in Figure 3.5. For clarity, we will start a pointer variable with d_ to indicate that it points to an object in the device memory. The program passes the address of d_A (i.e., &d_A) as the first parameter after casting it to a void pointer. That is, d_A will point to the device memory region allocated for the A vector. The size of the allocated region will be n times the size of a single-precision floating number, which is 4 bytes in most computers today. After the computation, cudaFree() is called with pointer d_A as input to free the storage space for the A vector from the device global memory.
浮动*d_A
float ∗d_A
int 大小 = n * sizeof(float);
int size = n ∗ sizeof(float);
cudaMalloc((void**)&d_A, 大小);
cudaMalloc((void∗∗)&d_A, size);
……
…
cudaFree(d_A);
cudaFree(d_A);
d_A、d_B和d_C中的地址是设备内存中的地址。这些地址不应在主机代码中取消引用。它们应该主要用于调用API函数和内核函数。在主机代码中取消引用设备内存点可能会在运行时导致异常或其他类型的运行时错误。
The addresses in d_A, d_B, and d_C are addresses in the device memory. These addresses should not be dereferenced in the host code. They should be mostly used in calling API functions and kernel functions. Dereferencing a device memory point in the host code can cause exceptions or other types of runtime errors during runtime.
读者应该使用类似的d_B和d_C指针变量声明及其相应的cudaMalloc()调用来完成图 3.5中vecAdd()示例的第 1 部分。此外,图 3.6中的第 3 部分可以通过cudaFree()调用d_B和d_C来完成。
Readers should complete Part 1 of the vecAdd() example in Figure 3.5 with similar declarations of d_B and d_C pointer variables as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with the cudaFree() calls for d_B and d_C.
一旦主机代码为数据对象分配了设备内存,它就可以请求将数据从主机传输到设备。这是通过调用 CUDA API 函数之一来完成的。图 3.8显示了这样一个 API 函数cudaMemcpy()。 cudaMemcpy ()函数有四个参数。第一个参数是指向要复制的数据对象的目标位置的指针。第二个参数指向源位置。第三个参数指定要复制的字节数。第四个参数表示复制涉及的内存类型:从主机内存到主机内存、从主机内存到设备内存、从设备内存到主机内存、从设备内存到设备内存。例如,cudaMemcpy()函数可用于将数据从设备内存的一个位置复制到设备内存的另一位置。5
Once the host code has allocated device memory for the data objects, it can request that data be transferred from host to device. This is accomplished by calling one of the CUDA API functions. Figure 3.8 shows such an API function, cudaMemcpy(). The cudaMemcpy() function takes four parameters. The first parameter is a pointer to the destination location for the data object to be copied. The second parameter points to the source location. The third parameter specifies the number of bytes to be copied. The fourth parameter indicates the types of memory involved in the copy: from host memory to host memory, from host memory to device memory, from device memory to host memory, and from device memory to device memory. For example, the cudaMemcpy() function can be used to copy data from one location of the device memory to another location of the device memory.5
图 3.8用于主机和设备之间数据传输的 CUDA API 函数。
Figure 3.8 CUDA API function for data transfer between host and device.
CUDA 中的错误处理
Error Handling in CUDA
一般来说,对于程序来说,检查和处理错误是非常重要的。 CUDA API 函数返回标志,指示它们在处理请求时是否发生错误。大多数错误是由于调用中使用的参数值不正确造成的。
In general, it is very important for a program to check and handle errors. CUDA API functions return flags that indicate whether an error has occurred when they served the request. Most errors are due to inappropriate argument values used in the call.
为简洁起见,我们不会在示例中显示错误检查代码。例如,图 3.9中的第 1 行显示了对cudaMalloc()的调用:
For brevity, we will not show error checking code in our examples. For example, line 1 in Figure 3.9 shows a call to cudaMalloc():
图 3.9 vecAdd()的更完整版本。
Figure 3.9 A more complete version of vecAdd().
cudaMalloc((void ***) &d_A, 大小);
cudaMalloc((void ∗∗) &d_A, size);
在实践中,我们应该用测试错误条件并打印出错误消息的代码包围调用,以便用户可以知道发生了错误。此类检查代码的简单版本如下:
In practice, we should surround the call with code that tests for error conditions and prints out error messages so that the user can be aware of the fact that an error has occurred. A simple version of such checking code is as follows:
cudaError_t err = cudaMalloc((void **** ) &d_A, 大小);
cudaError_t err = cudaMalloc((void ∗∗) &d_A, size);
如果(错误!= cudaSuccess){
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”, cudaGetErrorString( err), __FILE__, __LINE__);
printf(“%s in %s at line %d\n”, cudaGetErrorString( err), __FILE__, __LINE__);
退出(EXIT_FAILURE);
exit(EXIT_FAILURE);
}
}
这样,如果系统的设备内存不足,用户将收到有关情况的通知。
This way, if the system is out of device memory, the user will be informed about the situation.
通常会定义一个 C 宏来使源代码中的检查代码更加简洁。
One would usually define a C macro to make the checking code more concise in the source.
vecAdd ()函数调用cudaMemcpy()函数,在相加之前将 A 和 B 向量从主机复制到设备,并在相加完成后将 C 向量从设备复制到主机。假设A、B、d_A、d_B和size的值已经如我们之前讨论的那样设置;三个cudaMemcpy()调用如下所示。这两个符号常量cudaMemcopyHostToDevice和cudaMemcopyDeviceToHost是 CUDA 编程环境可识别的预定义常量。请注意,通过正确排序源指针和目标指针并使用适合传输类型的常量,可以使用同一函数在两个方向上传输数据。
The vecAdd() function calls the cudaMemcpy() function to copy A and B vectors from host to device before adding them and to copy the C vector from the device to host after the addition is done. Assume that the value of A, B, d_A, d_B, and size have already been set as we discussed before; the three cudaMemcpy() calls are shown below. The two symbolic constants, cudaMemcopyHostToDevice and cudaMemcopyDeviceToHost, are recognized, predefined constants of the CUDA programming environment. Note that the same function can be used to transfer data in both directions by properly ordering the source and destination pointers and using the appropriate constant for the transfer type.
cudaMemcpy(d_A, A, 大小, cudaMemcpyHostToDevice);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, 大小, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMemcpy(C, d_C, 大小, cudaMemcpyDeviceToHost);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
总而言之,图 3.4中的主程序调用vecAdd(),它也在主机上执行。vecAdd()函数如图3.5所示,分配设备内存、请求数据传输并启动执行实际向量加法的内核。我们经常将这种类型的主机代码称为用于启动内核的存根函数。内核完成执行后,vecAdd()还将结果数据从设备复制到主机。我们在图 3.9中展示了vecAdd()函数的更完整版本。
To summarize, the main program in Figure 3.4 calls vecAdd(), which is also executed on the host. The vecAdd() function, outlined in Figure 3.5, allocates device memory, requests data transfers, and launches the kernel that performs the actual vector addition. We often refer to this type of host code as a stub function for launching a kernel. After the kernel finishes execution, vecAdd() also copies result data from device to the host. We show a more complete version of the vecAdd() function in Figure 3.9.
与图 3.6相比,图 3.9中的vecAdd()函数已完成第 1 部分和第 3 部分。第 1 部分为d_A、d_B和d_C分配设备内存,并将A传输到d_A,将B传输到d_B。这是通过调用cudaMalloc()和cudaMemcpy()函数来完成的。鼓励读者使用适当的参数值编写自己的函数调用,并将他们的代码与图 3.9中所示的代码进行比较。第 2 部分调用内核,将在下一节中进行描述。第 3 部分将总和数据从设备内存复制到主机内存,以便该值可供main()使用。这是通过调用cudaMemcpy()函数来完成的。然后,它从设备内存中释放d_A、d_B和d_C的内存,这是通过调用cudaFree()函数来完成的。
Compared to Figure 3.6, the vecAdd() function in Figure 3.9 is complete for Parts 1 and 3. Part 1 allocates device memory for d_A, d_B, and d_C and transfers A to d_A and B to d_B. This is done by calling the cudaMalloc() and cudaMemcpy() functions. Readers are encouraged to write their own function calls with the appropriate parameter values and compare their code with that shown in Figure 3.9. Part 2 invokes the kernel and will be described in the following section. Part 3 copies the sum data from device memory to host memory so that the value will be available to main(). This is accomplished with a call to the cudaMemcpy() function. It then frees the memory for d_A, d_B, and d_C from the device memory, which is done by calls to the cudaFree() function.
我们现在准备更多地讨论 CUDA 内核函数以及启动这些内核函数的效果。在 CUDA 中,内核函数指定在并行阶段由所有线程执行的代码。由于所有这些线程都执行相同的代码,因此 CUDA 编程是著名的 SPMD(单程序、多数据)[Atallah1998]并行编程风格的一个实例,这是大规模并行计算系统的流行编程风格。6
We are now ready to discuss more about the CUDA kernel functions and the effect of launching these kernel functions. In CUDA, a kernel function specifies the code to be executed by all threads during a parallel phase. Since all these threads execute the same code, CUDA programming is an instance of the well-known SPMD (single program, multiple data) [Atallah1998] parallel programming style, a popular programming style for massively parallel computing systems.6
当主机代码启动内核时,CUDA 运行时系统会生成以两级层次结构组织的线程网格。每个网格被组织成线程块数组,为简洁起见,将其称为块。网格中的所有块的大小相同;每个块最多可以包含 1,024 个线程。7 图 3.10显示了每个块由 256 个线程组成的示例。每个线程块中的线程数是在内核启动时由主机代码指定的。同一内核可以在主机代码的不同部分启动不同数量的线程。对于给定的线程网格,块中的线程数可在blockDim变量中获得。在图3.10中, blockDim.x变量的值为256。一般来说,由于硬件效率的原因,线程块的维度应该是32的倍数。我们稍后会再讨论这个问题。
When a host code launches a kernel, the CUDA runtime system generates a grid of threads that are organized in a two-level hierarchy. Each grid is organized into an array of thread blocks, which will be referred to as blocks for brevity. All blocks of a grid are of the same size; each block can contain up to 1,024 threads.7 Figure 3.10 shows an example where each block consists of 256 threads. The number of threads in each thread block is specified by the host code when a kernel is launched. The same kernel can be launched with different numbers of threads at different parts of the host code. For a given grid of threads, the number of threads in a block is available in the blockDim variable. In Figure 3.10, the value of the blockDim.x variable is 256. In general, the dimensions of thread blocks should be multiples of 32 due to hardware efficiency reasons. We will revisit this later.
图 3.10网格中的所有线程都执行相同的内核代码。
Figure 3.10 All threads in a grid execute the same kernel code.
块中的每个线程都有一个唯一的threadIdx值。例如,块 0 中的第一个线程的threadIdx变量值为 0 ,第二个线程的值为 1,第三个线程的值为 2,等等。这允许每个线程组合其threadIdx和blockIdx值来创建唯一的全局索引对于整个网格来说。在图3.10中,数据索引i的计算方式为i = blockIdx.x*blockDim.x+threadIdx.x。由于在我们的示例中blockDim为 256,因此块 0 中线程的i值范围为 0 到 255。块 1 中线程的i值范围为 256 到 511。块 2 中线程的i值范围为 512 到 767。也就是说,这三个块中线程的i值形成从 0 到 767 的值的连续覆盖。由于每个线程使用i访问d_A、d_B和d_C,因此这些线程覆盖了原始循环的前 768 次迭代。通过启动具有更多块的内核,可以处理更大的向量。通过启动具有n或更多线程的内核,可以处理长度为n的向量。
Each thread in a block has a unique threadIdx value. For example, the first thread in block 0 has value 0 in its threadIdx variable, the second thread has value 1, the third thread has value 2, etc. This allows each thread to combine its threadIdx and blockIdx values to create a unique global index for itself with the entire grid. In Figure 3.10, a data index i is calculated as i=blockIdx.x ∗ blockDim.x + threadIdx.x. Since blockDim is 256 in our example, the i values of threads in block 0 ranges from 0 to 255. The i values of threads in block 1 range from 256 to 511. The i values of threads in block 2 range from 512 to 767. That is, the i values of the threads in these three blocks form a continuous coverage of the values from 0 to 767. Since each thread uses i to access d_A, d_B, and d_C, these threads cover the first 768 iterations of the original loop. By launching the kernel with a larger number of blocks, one can process larger vectors. By launching a kernel with n or more threads, one can process vectors of length n.
图 3.11显示了向量加法的核函数。语法是 ANSI C,带有一些值得注意的扩展。首先,在vecAddKernel()的声明前面有一个 CUDA 特定关键字__global__。该关键字指示该函数是一个内核,并且可以从主机函数调用它以在设备上生成线程网格。
Figure 3.11 shows a kernel function for a vector addition. The syntax is ANSI C with some notable extensions. First, there is a CUDA specific keyword __global__ in front of the declaration of vecAddKernel(). This keyword indicates that the function is a kernel and that it can be called from a host function to generate a grid of threads on a device.
图 3.11向量加法核函数及其启动语句。
Figure 3.11 A vector addition kernel function and its launch statement.
一般来说,CUDA 在函数声明中使用三个限定符关键字来扩展 C 语言。这些关键字的含义总结在图 3.12中。__global__关键字指示所声明的函数是 CUDA 内核函数。请注意,“global”一词的两侧各有两个下划线字符。__global__函数将在设备上执行,并且只能从主机代码中调用。__device__关键字指示所声明的函数是 CUDA 设备函数。设备函数在 CUDA 设备上执行,只能从内核函数或其他设备函数调用。8
In general, CUDA extends C language with three qualifier keywords in function declarations. The meaning of these keywords is summarized in Figure 3.12. The __global__ keyword indicates that the function being declared is a CUDA kernel function. Note that there are two underscore characters on each side of the word “global.” A __global__ function is to be executed on the device and can only be called from the host code. The __device__ keyword indicates that the function being declared is a CUDA device function. A device function executes on a CUDA device and can only be called from a kernel function or another device function.8
图 3.12用于函数声明的 CUDA C 关键字。
Figure 3.12 CUDA C keywords for function declaration.
__host__关键字指示所声明的函数是 CUDA 主机函数。主机函数只是在主机上执行的传统 C 函数,只能从另一个主机函数调用。默认情况下,如果 CUDA 程序中的所有函数在其声明中没有任何 CUDA 关键字,则它们都是主机函数。这是有道理的,因为许多 CUDA 应用程序都是从纯 CPU 执行环境移植的。程序员在移植过程中会添加内核函数和设备函数。原始功能仍作为主机功能。将所有函数默认为主机函数可以使程序员省去更改所有原始函数声明的繁琐工作。
The __host__ keyword indicates that the function being declared is a CUDA host function. A host function is simply a traditional C function that executes on the host and can only be called from another host function. By default, all functions in a CUDA program are host functions if they do not have any of the CUDA keywords in their declaration. This makes sense since many CUDA applications are ported from CPU-only execution environments. The programmer would add kernel functions and device functions during the porting process. The original functions remain as host functions. Having all functions to default into host functions spares the programmer the tedious work to change all original function declarations.
请注意,可以在函数声明中同时使用__host__和__device__ 。这种组合告诉编译系统为同一函数生成两个版本的目标文件。一种是在主机上执行的,并且只能从主机函数中调用。另一个在设备上执行,只能从设备或内核函数中调用。这支持可以重新编译相同函数源代码以生成设备版本的常见用例。许多用户库函数可能属于这一类。
Note that one can use both __host__ and __device__ in a function declaration. This combination tells the compilation system to generate two versions of object files for the same function. One is executed on the host and can only be called from a host function. The other is executed on the device and can only be called from a device or kernel function. This supports a common-use case when the same function source code can be recompiled to generate a device version. Many user library functions will likely fall into this category.
图 3.10中第二个值得注意的 ANSI C 扩展是关键字threadIdx.x。blockIdx.x和blockDim.x。请注意,所有线程都执行相同的内核代码。需要有一种方法让它们能够区分自身并将每个线程引导至数据的特定部分。这些关键字标识与为线程提供标识坐标的硬件寄存器相对应的预定义变量。不同的线程将在其threadIdx.x、blockIdx.x和blockDim.x变量中看到不同的值。为了简单起见,我们将线程称为thread blockIdx.x, threadIdx.x。请注意,.x意味着可能存在.y和.z。我们很快就会回到这一点。
The second notable extension to ANSI C in Figure 3.10 is the keywords threadIdx.x. blockIdx.x, and blockDim.x. Note that all threads execute the same kernel code. There needs to be a way for them to distinguish among themselves and direct each thread toward a particular part of the data. These keywords identify predefined variables that correspond to hardware registers that provide the identifying coordinates to threads. Different threads will see different values in their threadIdx.x, blockIdx.x, and blockDim.x variables. For simplicity, we will refer to a thread as threadblockIdx.x, threadIdx.x. Note that the .x implies that there might be .y and .z. We will come back to this point soon.
图 3.11中有一个自动(局部)变量i。在 CUDA 内核函数中,自动变量是每个线程私有的。也就是说,将为每个线程生成一个i版本。如果内核以 10,000 个线程启动,则i将会有 10,000 个版本,每个线程一个。线程为其i变量分配的值对其他线程不可见。我们将在第 5 章再次讨论这些自动变量。
There is an automatic (local) variable i in Figure 3.11. In a CUDA kernel function, automatic variables are private to each thread. That is, a version of i will be generated for every thread. If the kernel is launched with 10,000 threads, there will be 10,000 versions of i, one for each thread. The value assigned by a thread to its i variable is not visible to other threads. We will discuss these automatic variables again in Chapter 5.
图 3.4和图 3.11之间的快速比较揭示了 CUDA 内核和 CUDA 内核启动的重要见解。图 3.11中的核函数没有与图 3.4中的循环相对应的循环。读者应该问循环去了哪里。答案是循环现在被线程网格所取代。整个网格相当于循环。网格中的每个线程对应于原始循环的一次迭代。
A quick comparison between Figure 3.4 and Figure 3.11 reveals an important insight for CUDA kernels and a CUDA kernel launch. The kernel function in Figure 3.11 does not have a loop that corresponds to the one in Figure 3.4. Readers should ask where the loop went. The answer is that the loop is now replaced with the grid of threads. The entire grid forms the equivalent of the loop. Each thread in the grid corresponds to one iteration of the original loop.
请注意,图 3.11中的addVecKernel()中有一个if (i<n)语句。这是因为并非所有向量长度都可以表示为块大小的倍数。例如,如果向量长度为 100,则最小有效线程块维度为 32。假设我们选择 32 作为块大小。需要启动 4 个线程块来处理所有 100 个向量元素。然而,四个线程块将有 128 个线程。我们需要禁止线程块 3 中的最后 28 个线程执行原始程序未预期的工作。由于所有线程都将执行相同的代码,因此所有线程都将根据n (即 100 )测试其i值。使用if (i<n)语句,前 100 个线程将执行加法,而最后 28 个则不会。这允许内核处理任意长度的向量。
Note that there is an if (i<n) statement in addVecKernel() in Figure 3.11. This is because not all vector lengths can be expressed as multiples of the block size. For example, if the vector length is 100, the smallest efficient thread block dimension is 32. Assume that we picked 32 as the block size. One would need to launch four thread blocks to process all the 100 vector elements. However, the four thread blocks would have 128 threads. We need to disable the last 28 threads in thread block 3 from doing work not expected by the original program. Since all threads are to execute the same code, all will test their i values against n, which is 100. With the if (i<n) statement, the first 100 threads will perform the addition whereas the last 28 will not. This allows the kernel to process vectors of arbitrary lengths.
当主机代码启动内核时,它通过执行配置参数设置网格和线程块尺寸。如图 3.13所示。配置参数在传统 C 函数参数之前的<<<和>>>之间给出。第一个配置参数给出了网格中线程块的数量。第二个指定每个线程块中的线程数。在此示例中,每个块中有 256 个线程。为了确保我们有足够的线程来覆盖所有向量元素,我们将 C 上限函数应用于n/256.0。使用浮点值 256.0 确保我们为除法生成一个浮点值,以便上限值函数可以正确地将其向上舍入。例如,如果我们有 1,000 个线程,我们将启动 ceil(1,000/256.0)=4 个线程块。结果,该语句将启动 4×256=1,024 个线程。通过内核中的if (i < n)语句,如图3.11所示,前 1,000 个线程将对 1,000 个向量元素执行加法。其余 24 个不会。
When the host code launches a kernel, it sets the grid and thread block dimensions via execution configuration parameters. This is illustrated in Figure 3.13. The configuration parameters are given between the <<< and >>> before the traditional C function arguments. The first configuration parameter gives the number of thread blocks in the grid. The second specifies the number of threads in each thread block. In this example, there are 256 threads in each block. To ensure that we have enough threads to cover all the vector elements, we apply the C ceiling function to n/256.0. Using floating-point value 256.0 ensures that we generate a floating value for the division so that the ceiling function can round it up correctly. For example, if we have 1,000 threads, we would launch ceil(1,000/256.0)=4 thread blocks. As a result, the statement will launch 4×256=1,024 threads. With the if (i < n) statement in the kernel as shown in Figure 3.11, the first 1,000 threads will perform addition on the 1,000 vector elements. The remaining 24 will not.
图 3.13向量加法核函数及其启动语句。
Figure 3.13 A vector addition kernel function and its launch statement.
图 3.14显示了vecAdd()的最终主机代码。该源代码完成了图 3.5中的框架。图3.10和3.14共同说明了一个简单的CUDA程序,它由主机代码和设备组成核心。该代码被硬连线为使用每个 256 个线程的线程块。但是,所使用的线程块的数量取决于向量 ( n )的长度。如果n为750,将使用三个线程块;如果n为4,000,则将使用16个线程块;如果n为 2,000,000,则将使用 7,813 个块。请注意,所有线程块都对向量的不同部分进行操作。它们可以按任意顺序执行。具有少量执行资源的小型 GPU 可以并行执行这些线程块中的一个或两个。更大的 GPU 可以并行执行 64 或 128 个块。这使得 CUDA 内核在硬件执行速度方面具有可扩展性。也就是说,相同的代码在小型 GPU 上以较低的性能运行,而在较大的 GPU 上以较高的性能运行。我们将在第 4 章中再次讨论这一点。
Figure 3.14 shows the final host code of vecAdd(). This source code completes the skeleton in Figure 3.5. Figures 3.10 and 3.14 jointly illustrate a simple CUDA program that consists of both the host code and a device kernel. The code is hardwired to use thread blocks of 256 threads each. The number of thread blocks used, however, depends on the length of the vectors (n). If n is 750, three thread blocks will be used; if n is 4,000, 16 thread blocks will be used; if n is 2,000,000, 7,813 blocks will be used. Note that all the thread blocks operate on different parts of the vectors. They can be executed in any arbitrary order. A small GPU with a small amount of execution resources may execute one or two of these thread blocks in parallel. A larger GPU may execute 64 or 128 blocks in parallel. This gives CUDA kernels scalability in execution speed with hardware. That is, same code runs at lower performance on small GPUs and higher performance on larger GPUs. We will revisit this point again in Chapter 4.
图 3.14 vecAdd()的完整版本。
Figure 3.14 A complete version of vecAdd().
需要指出的是,使用向量加法示例是为了简单起见。实际上,分配设备内存、从主机到设备的输入数据传输、从设备到主机的输出数据传输以及取消分配设备内存的开销可能会使生成的代码比图 3.4中的原始顺序代码慢。这是因为内核完成的计算量相对于处理的数据量来说很小。对于两个浮点输入操作数和一个浮点输出操作数仅执行一次加法。实际应用程序通常具有相对于处理的数据量而言需要更多工作的内核,这使得额外的开销是值得的。他们还倾向于将数据保存在设备内存中多个内核调用,以便可以分摊开销。我们将展示此类应用的几个示例。
It is important to point out that the vector addition example is used for its simplicity. In practice, the overhead of allocating device memory, input data transfer from host to device, output data transfer from device to host, and de-allocating device memory will likely make the resulting code slower than the original sequential code in Figure 3.4. This is because the amount of calculation done by the kernel is small relative to the amount of data processed. Only one addition is performed for two floating-point input operands and one floating-point output operand. Real applications typically have kernels where much more work is needed relative to the amount of data processed, which makes the additional overhead worthwhile. They also tend to keep the data in the device memory across multiple kernel invocations so that the overhead can be amortized. We will present several examples of such applications.
本章提供了 CUDA 编程模型的快速概述。 CUDA扩展了C语言以支持并行计算。我们在本章中讨论了这些扩展的一个子集。为了您的方便,我们将本章讨论的扩展总结如下。
This chapter provided a quick overview of the CUDA programming model. CUDA extends the C language to support parallel computing. We discussed a subset of these extensions in this chapter. For your convenience, we summarize the extensions that we have discussed in this chapter as follows.
CUDA扩展了C函数声明语法以支持异构并行计算。图 3.12总结了这些扩展。使用__global__、__device__或__host__之一,CUDA 程序员可以指示编译器生成内核函数、设备函数或主机函数。所有不带任何这些关键字的函数声明都默认为宿主函数。如果在函数声明中同时使用__host__和__device__,则编译器会生成该函数的两个版本,一种用于设备,一种用于主机。如果函数声明没有任何 CUDA 扩展关键字,则该函数默认为主机函数。
CUDA extends the C function declaration syntax to support heterogeneous parallel computing. The extensions are summarized in Figure 3.12. Using one of __global__, __device__, or __host__, a CUDA programmer can instruct the compiler to generate a kernel function, a device function, or a host function. All function declarations without any of these keywords are defaulted to host functions. If both __host__ and __device__ are used in a function declaration, the compiler generates two versions of the function, one for the device and one for the host. If a function declaration does not have any CUDA extension keyword, the function defaults into a host function.
CUDA 使用<<<和>>>包围的内核执行配置参数扩展了 C 函数调用语法。这些执行配置参数仅在调用内核函数或内核启动期间使用。我们讨论了定义网格尺寸和每个块尺寸的执行配置参数。读者应参阅CUDA 编程指南 [NVIDIA2011],了解有关内核启动扩展以及其他类型的执行配置参数的更多详细信息。
CUDA extends C function call syntax with kernel execution configuration parameters surrounded by <<< and >>>. These execution configuration parameters are only used during a call to a kernel function, or a kernel launch. We discussed the execution configuration parameters that define the dimensions of the grid and the dimensions of each block. Readers should refer to the CUDA Programming Guide [NVIDIA2011] for more details of the kernel launch extensions as well as other types of execution configuration parameters.
CUDA 内核可以访问一组预定义变量,这些变量允许每个线程相互区分并确定每个线程要处理的数据区域。我们讨论了threadIdx、blockDim和本章中的blockIdx变量。在第 4 章中,我们将讨论使用这些变量的更多细节。
CUDA kernels can access a set of predefined variables that allow each thread to distinguish among themselves and to determine the area of data each thread is to work on. We discussed the threadIdx, blockDim, and blockIdx variables in this chapter. In Chapter 4, we will discuss more details of using these variables.
CUDA支持一组API函数来为CUDA程序提供服务。我们在本章中讨论的服务是cudaMalloc()、cudaFree()和cudaMemcpy()函数。这些函数代表调用程序分配设备内存并在主机和设备之间传输数据。读者可参考CUDA 编程指南 [NVIDIA2011]了解其他 CUDA API 函数。
CUDA supports a set of API functions to provide services to CUDA programs. The services that we discussed in this chapter are the cudaMalloc(), cudaFree(), and cudaMemcpy() functions. These functions allocate device memory and transfer data between host and device on behalf of the calling program. Readers are referred to the CUDA Programming Guide [NVIDIA2011] for other CUDA API functions.
本章的目标是介绍 CUDA 编程模型的核心概念以及用于编写简单 CUDA 程序的 C 语言基本 CUDA 扩展。本章绝不是对所有 CUDA 功能的全面介绍。其中一些功能将在本书的其余部分中介绍。然而,我们的重点将是关键概念而不是细节。我们只会介绍代码示例中并行编程技术所需的足够的 CUDA 功能。一般来说,我们鼓励读者始终查阅CUDA 编程指南,以了解 CUDA 功能的更多详细信息。
Our goal for this chapter is to introduce the core concepts of the CUDA programming model and the essential CUDA extensions to C for writing a simple CUDA program. The chapter is by no means a comprehensive account of all CUDA features. Some of these features will be covered in the remainder of the book. However, our emphasis will be on key concepts rather than details. We will only introduce enough CUDA features that are needed in our code examples for parallel programming techniques. In general, we would like to encourage readers to always consult the CUDA Programming Guide for more details of the CUDA features.
3.1. 矩阵相加将两个输入矩阵 B 和 C 生成一个输出矩阵 A。输出矩阵 A 的每个元素都是输入矩阵 B 和 C 的对应元素之和,即 A[i][j] = = B[i][j] + C[i][j]。为简单起见,我们将仅处理元素为单精度浮点数的方阵。编写一个矩阵加法内核和主机存根函数,可以使用四个参数调用:指向输出矩阵的指针、指向第一个输入矩阵的指针、指向第二个输入矩阵的指针以及每个维度中的元素数量。使用以下说明:
3.1. A matrix addition takes two input matrices B and C and produces one output matrix A. Each element of the output matrix A is the sum of the corresponding elements of the input matrices B and C, that is, A[i][j] == B[i][j] + C[i][j]. For simplicity, we will only handle square matrices of which the elements are single-precision floating-point numbers. Write a matrix addition kernel and the host stub function that can be called with four parameters: pointer to the output matrix, pointer to the first input matrix, pointer to the second input matrix, and the number of elements in each dimension. Use the following instructions:
A。 通过为输入和输出矩阵分配内存、将输入数据传输到设备、启动内核、将输出数据传输到主机以及为输入和输出数据释放设备内存来编写主机存根函数。对此步骤保持执行配置参数打开。
a. Write the host stub function by allocating memory for the input and output matrices, transferring input data to device, launch the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Leave the execution configuration parameters open for this step.
b. 编写一个内核,让每个线程生成一个输出矩阵元素。填写设计的执行配置参数。
b. Write a kernel that has each thread producing one output matrix element. Fill in the execution configuration parameters for the design.
C。 编写一个内核,让每个线程生成一个输出矩阵行。填写设计的执行配置参数。
c. Write a kernel that has each thread producing one output matrix row. Fill in the execution configuration parameters for the design.
d. 编写一个内核,让每个线程生成一个输出矩阵列。填写设计的执行配置参数。
d. Write a kernel that has each thread producing one output matrix column. Fill in the execution configuration parameters for the design.
e. Analyze the pros and cons of each preceding kernel design.
3.2. 矩阵向量乘法采用输入矩阵 B 和向量 C 并生成一个输出向量 A。输出向量 A 的每个元素都是输入矩阵 B 和 C 的一行的点积,即 A[i] =ΣjB [ i][j]+C[j]。为简单起见,我们仅处理元素为单精度浮点数的方阵。编写一个矩阵向量乘法内核和可以使用四个参数调用的主机存根函数:指向输出矩阵的指针、指向输入矩阵的指针、指向输入向量的指针以及每个维度中的元素数量。
3.2. A matrix–vector multiplication takes an input matrix B and a vector C and produces one output vector A. Each element of the output vector A is the dot product of one row of the input matrix B and C, that is, A[i]=∑j B[i][j]+C[j]. For simplicity, we will only handle square matrices of which the elements are single-precision floating-point numbers. Write a matrix–vector multiplication kernel and the host stub function that can be called with four parameters: pointer to the output matrix, pointer to the input matrix, pointer to the input vector, and the number of elements in each dimension.
3.3. 一位新的暑期实习生对 CUDA 感到沮丧。他一直抱怨CUDA非常繁琐:他必须声明许多他计划在主机和设备上执行两次的函数,一次作为主机函数,一次作为设备函数。你的回应是什么?
3.3. A new summer intern was frustrated with CUDA. He has been complaining that CUDA is very tedious: he had to declare many functions that he plans to execute on both the host and the device twice, once as a host function and once as a device function. What is your response?
3.4. 完成图 3.6中函数的第 1 部分和第 2 部分。
3.4. Complete Parts 1 and 2 of the function in Figure 3.6.
3.5. 如果我们需要使用每个线程来计算向量加法的一个输出元素,那么将线程/块索引映射到数据索引的表达式是什么:
3.5. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index:
(A) i==threadIdx.x+threadIdx.y;
(A) i==threadIdx.x+threadIdx.y;
(二) i==blockIdx.x+threadIdx.x;
(B) i==blockIdx.x+threadIdx.x;
(C) i==blockIdx.x*blockDim.x+threadIdx.x;
(C) i==blockIdx.x∗blockDim.x+threadIdx.x;
(四) i==blockIdx.x * threadIdx.x;
(D) i==blockIdx.x ∗ threadIdx.x;
3.6. 我们想要使用每个线程来计算向量加法的两个(相邻)元素,假设变量 i 应该是向量加法的索引线程要处理的第一个元素。将线程/块索引映射到数据索引的表达式是什么?
3.6. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index?
(A) i==blockIdx.x*blockDim.x+threadIdx.x+2;
(A) i==blockIdx.x∗blockDim.x+threadIdx.x+2;
(二) i==blockIdx.x*threadIdx.x*2
(B) i==blockIdx.x∗threadIdx.x∗2
(C) i==(blockIdx.x*blockDim.x+threadIdx.x)*2
(C) i==(blockIdx.x∗blockDim.x+threadIdx.x)∗2
(四) i==blockIdx.x*blockDim.x*2+threadIdx.x
(D) i==blockIdx.x∗blockDim.x∗2+threadIdx.x
3.7. 对于向量加法,假设向量长度为2000,每个线程计算1个输出元素,线程块大小为512个线程。网格中有多少个线程?
3.7. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid?
1. 阿塔拉·MJ,编辑。计算手册的算法和理论。佛罗里达州巴科拉顿:CRC Press; 1998.
1. Atallah MJ, ed. Algorithms and Theory of Computation Handbook. Baco Raton, FL: CRC Press; 1998.
2. Flynn M。一些计算机组织及其有效性。IEEE 传输计算。 1972;C-21:948。
2. Flynn M. Some computer organizations and their effectiveness. IEEE Trans Comput. 1972;C-21:948.
3. NVIDIA 公司,NVIDIA CUDA C 编程指南,版本 4.2,2012 年 4 月,可从以下位置获取:< http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf >。
3. NVIDIA Corporation, NVIDIA CUDA C Programming Guide, version 4.2, April 2012, Available at: <http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf>.
4.帕特·YN,帕特尔·SJ。计算系统简介:从 Bits 和 Gates 到 C 及其他 纽约:McGraw-Hill; 1972年。
4. Patt YN, Patel SJ. Introduction to Computing Systems: From Bits and Gates to C and Beyond New York: McGraw-Hill; 1972.
5. Stratton, JA, Stone, SS, & Hwu, WW MCUDA:多核 CPU 的 CUDA 内核的高效实现,第 21 届并行计算语言和编译器国际研讨会,2008 年 7 月 30-31 日,加拿大。可作为计算机科学讲义,2008 年。
5. Stratton, J. A., Stone, S. S., & Hwu, W. W. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs, The 21st International Workshop on Languages and Compilers for Parallel Computing, July 30–31, Canada, 2008. Also available as Lecture Notes in Computer Science, 2008.
1 CUDA C 还支持越来越多的 C++ 功能子集。有兴趣的读者应参阅CUDA 编程指南,了解有关支持的 C++ 功能的更多信息。
1CUDA C also supports a growing subset of C++ features. Interested readers should refer to the CUDA Programming Guide for more information about the supported C++ features.
2目前有一种趋势是将CPU和GPU集成到同一个芯片封装中,通常称为融合。融合架构通常为主机和设备提供统一的内存空间。有新的编程框架,例如 GMAC,可以利用统一的内存空间并消除数据复制成本。
2There is a trend to integrate CPUs and GPUs into the same chip package, commonly referred to as fusion. Fusion architectures often have a unified memory space for host and devices. There are new programming frameworks, such as GMAC, that take advantage of the unified memory space and eliminate data copying costs.
3 cudaMalloc()返回通用对象这一事实使得动态分配的多维数组的使用更加复杂。我们将在4.2 节中解决这个问题。
3The fact that cudaMalloc() returns a generic object makes the use of dynamically allocated multidimensional arrays more complex. We will address this issue in Section 4.2.
4请注意, cudaMalloc() 的格式与 C malloc()函数不同。 C malloc()函数返回指向已分配对象的指针。它只需要一个参数来指定分配对象的大小。 cudaMalloc ()函数写入指针变量,其地址作为第一个参数给出。因此, cudaMalloc()函数采用两个参数。 cudaMalloc()的两参数格式允许它使用返回值以与其他 CUDA API 函数相同的方式报告任何错误。
4Note that cudaMalloc() has a different format from the C malloc() function. The C malloc() function returns a pointer to the allocated object. It takes only one parameter that specifies the size of the allocated object. The cudaMalloc() function writes to the pointer variable of which the address is given as the first parameter. As a result, the cudaMalloc() function takes two parameters. The two-parameter format of cudaMalloc() allows it to use the return value to report any errors in the same way as other CUDA API functions.
5请注意, cudaMemcpy()不能用于在多 GPU 系统中的不同 GPU 之间进行复制。
5Please note cudaMemcpy() cannot be used to copy between different GPUs in multi-GPU systems.
6请注意,SPMD 与 SIMD(单指令、多数据)不同[Flynn1972]。在 SPMD 系统中,并行处理单元对数据的多个部分执行相同的程序。然而,这些处理单元不需要同时执行相同的指令。在 SIMD 系统中,所有处理单元在任何时刻都执行相同的指令。
6Note that SPMD is not the same as SIMD (single instruction, multiple data) [Flynn1972]. In an SPMD system, the parallel processing units execute the same program on multiple parts of the data. However, these processing units do not need to be executing the same instruction at the same time. In an SIMD system, all processing units are executing the same instruction at any instant.
7在 CUDA 3.0 及更高版本中,每个线程块最多可以有 1,024 个线程。一些早期的 CUDA 版本仅允许块中最多 512 个线程。
7Each thread block can have up to 1,024 threads in CUDA 3.0 and later. Some earlier CUDA versions allow only up to 512 threads in a block.
8稍后我们将解释在不同代的 CUDA 中使用间接函数调用和递归的规则。一般来说,应该避免在其设备函数和内核函数中使用递归和间接函数调用,以实现最大的可移植性。
8We will explain the rules for using indirect function calls and recursions in different generations of CUDA later. In general, one should avoid the use of recursion and indirect function calls in their device functions and kernel functions to allow maximal portability.
4.1 Cuda 线程组织
4.1 Cuda Thread Organization
4.2 将线程映射到多维数据
4.2 Mapping Threads to Multidimensional Data
4.3 矩阵-矩阵乘法——更复杂的内核
4.3 Matrix-Matrix Multiplication—A More Complex Kernel
4.4 同步和透明的可扩展性
4.4 Synchronization and Transparent Scalability
4.5 将资源分配给块
4.5 Assigning Resources to Blocks
4.6 查询设备属性
4.6 Querying Device Properties
4.7 线程调度和延迟容忍
4.7 Thread Scheduling and Latency Tolerance
4.8 概括
4.8 Summary
4.9 练习
4.9 Exercises
细粒度、数据并行线程是 CUDA 并行执行的基本手段。正如我们在第 3 章中所解释的,启动 CUDA 内核会创建一个所有执行内核函数的线程网格。也就是说,内核函数指定每个单独线程在运行时执行的 C 语句。每个线程使用唯一的坐标或线程索引来标识要处理的数据结构部分。线程索引可以多维组织,方便访问多维数组。本章介绍了有关网格中线程的组织、资源分配、同步和调度的更多详细信息。了解这些细节的 CUDA 程序员能够很好地表达和理解高性能 CUDA 应用程序中的并行性。
Fine-grained, data-parallel threads are the fundamental means of parallel execution in CUDA. As we explained in Chapter 3, launching a CUDA kernel creates a grid of threads that all execute the kernel function. That is, the kernel function specifies the C statements that are executed by each individual thread at runtime. Each thread uses a unique coordinate, or thread index, to identify the portion of the data structure to process. The thread index can be organized multidimensionally to facilitate access to multidimensional arrays. This chapter presents more details on the organization, resource assignment, synchronization, and scheduling of threads in a grid. A CUDA programmer who understands these details is well equipped to express and understand the parallelism in high-performance CUDA applications.
内置变量
Built-In Variables
许多编程语言都有内置变量。这些变量具有特殊的含义和目的。这些变量的值通常由运行时系统预先初始化。例如,在 CUDA 内核函数中,gridDim、blockDim、blockIdx和threadIdx都是内置变量。它们的值由 CUDA 运行时系统预先初始化,并且可以在内核函数中引用。程序员应避免将这些变量用于任何其他目的。
Many programming languages have built-in variables. These variables have special meaning and purpose. The values of these variables are often preinitialized by the runtime system. For example, in a CUDA kernel function, gridDim, blockDim, blockIdx, and threadIdx are all built-in variables. Their values are preinitialized by the CUDA runtime systems and can be referenced in the kernel function. The programmers should refrain from using these variables for any other purpose.
回想一下第 3 章,网格中的所有 CUDA 线程都执行相同的内核函数,它们依靠坐标来相互区分并识别要处理的数据的适当部分。这些线程被组织成两级层次结构:网格由一个或多个块组成,每个块又由一个或多个线程组成。块中的所有线程共享相同的块索引,该索引可以作为内核中的blockIdx变量进行访问。每个线程还有一个线程索引,可以作为内核中的threadIdx变量进行访问。对于 CUDA 程序员来说,blockIdx和threadIdx显示为内置的预初始化变量,可以在内核函数中访问(请参阅“内置变量”边栏)。当线程执行内核函数时,对blockIdx和threadIdx变量的引用返回线程的坐标。内核启动语句中的执行配置参数指定网格的尺寸和每个块的尺寸。这些尺寸可用作内核函数中预定义的内置变量gridDim和blockDim 。
Recall from Chapter 3 that all CUDA threads in a grid execute the same kernel function and they rely on coordinates to distinguish themselves from each other and to identify the appropriate portion of the data to process. These threads are organized into a two-level hierarchy: a grid consists of one or more blocks and each block in turn consists of one or more threads. All threads in a block share the same block index, which can be accessed as the blockIdx variable in a kernel. Each thread also has a thread index, which can be accessed as the threadIdx variable in a kernel. To a CUDA programmer, blockIdx and threadIdx appear as built-in, preinitialized variables that can be accessed within kernel functions (see “Built-in Variables” sidebar). When a thread executes a kernel function, references to the blockIdx and threadIdx variables return the coordinates of the thread. The execution configuration parameters in a kernel launch statement specify the dimensions of the grid and the dimensions of each block. These dimensions are available as predefined built-in variables gridDim and blockDim in kernel functions.
等级组织
Hierarchical Organizations
与 CUDA 线程一样,许多现实世界的系统都是分层组织的。美国的电话系统就是一个很好的例子。在顶层,电话系统由“区域”组成,每个区域对应一个地理区域。同一区域内的所有电话线路都具有相同的三位数区号。电话区域通常比城市还要大。例如,伊利诺伊州中部的许多县市都在同一个电话区域内,并且共享相同的区号217。在一个区域内,每条电话线都有一个七位数字的本地电话号码,这使得每个区域最多可以拥有约1000万个数字。我们可以将每条电话线视为一个 CUDA 线程,将区号视为 CUDA blockIdx,将七位数本地号码视为 CUDA threadIdx。这种分层组织允许系统拥有大量电话线路,同时保留呼叫同一区域的“局部性”。即拨打同一地区的电话线路时,只需拨打本地号码即可。只要我们大部分的电话都是在本地拨打,就不需要拨打区号。如果我们偶尔需要拨打其他地区的电话线,我们会拨打1和区号,然后拨打本地号码(这就是为什么任何地区的本地号码都不应该以1开头的原因)。 CUDA 线程的分层组织还提供了一种局部性形式。我们很快就会研究这个地方。
Like CUDA threads, many real-world systems are organized hierarchically. The U.S. telephone system is a good example. At the top level, the telephone system consists of “areas,” each of which corresponds to a geographical area. All telephone lines within the same area have the same three-digit area code. A telephone area is typically larger than a city. For example, many counties and cities of central Illinois are within the same telephone area and share the same area code 217. Within an area, each phone line has a seven-digit local phone number, which allows each area to have a maximum of about 10 million numbers. One can think of each phone line as a CUDA thread, the area code as the CUDA blockIdx, and the seven-digital local number as the CUDA threadIdx. This hierarchical organization allows the system to have a very large number of phone lines while preserving “locality” for calling the same area. That is, when dialing a phone line in the same area, a caller only needs to dial the local number. As long as we make most of our calls within the local area, we do not need to dial the area code. If we occasionally need to call a phone line in another area, we dial 1 and the area code, followed by the local number (this is the reason why no local number in any area should start with a 1). The hierarchical organization of CUDA threads also offers a form of locality. We will study this locality soon.
一般来说,网格是块1的 3D 数组,每个块都是线程的 3D 数组。程序员可以通过将未使用的维度设置为 1 来选择使用更少的维度。网格的确切组织由内核启动语句的执行配置参数(在<<<和>>>内)确定。第一个执行配置参数指定网格的尺寸(以块数为单位)。第二个指定每个块的维度(以线程数表示)。每个这样的参数都是dim3类型,它是一个具有三个无符号整数字段x、y和z的 C结构体。这三个字段对应三个维度。
In general, a grid is a 3D array of blocks1 and each block is a 3D array of threads. The programmer can choose to use fewer dimensions by setting the unused dimensions to 1. The exact organization of a grid is determined by the execution configuration parameters (within <<< and >>>) of the kernel launch statement. The first execution configuration parameter specifies the dimensions of the grid in number of blocks. The second specifies the dimensions of each block in number of threads. Each such parameter is of dim3 type, which is a C struct with three unsigned integer fields, x, y, and z. These three fields correspond to the three dimensions.
对于 1D 或 2D 网格和块,为了清楚起见,未使用的尺寸字段应设置为 1。例如,以下主机代码可用于启动vecAddkernel()内核函数并生成由 128 个块组成的一维网格,每个块由 32 个线程组成。网格中的线程总数为128×32=4,096。
For 1D or 2D grids and blocks, the unused dimension fields should be set to 1 for clarity. For example, the following host code can be used to launch the vecAddkernel() kernel function and generate a 1D grid that consists of 128 blocks, each of which consists of 32 threads. The total number of threads in the grid is 128×32=4,096.
昏暗3 昏暗块(128, 1, 1);
dim3 dimBlock(128, 1, 1);
昏暗3 昏暗网格(32, 1, 1);
dim3 dimGrid(32, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(…);
vecAddKernel<<<dimGrid, dimBlock>>>(…);
请注意,dimBlock和dimGrid是程序员定义的主机代码变量。这些变量可以有名称,只要它们是dim3类型并且内核启动使用适当的名称即可。例如,以下语句与前面的语句完成相同的功能:
Note that dimBlock and dimGrid are host code variables defined by the programmer. These variables can have names as long as they are of dim3 type and the kernel launch uses the appropriate names. For example, the following statements accomplish the same as the previous statements:
dim3 狗(128, 1, 1);
dim3 dog(128, 1, 1);
暗淡3猫(32,1,1);
dim3 cat(32, 1, 1);
vecAddKernel<<<狗,猫>>>(…);
vecAddKernel<<<dog, cat>>>(…);
网格和块尺寸也可以根据其他变量计算。例如图3.14中的内核启动可以写成:
The grid and block dimensions can also be calculated from other variables. For example, the kernel launch in Figure 3.14 can be written as:
dim3 dimGrid(ceil(n/256.0), 1, 1);
dim3 dimGrid(ceil(n/256.0), 1, 1);
暗淡3暗淡块(256,1,1);
dim3 dimBlock(256, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(…);
vecAddKernel<<<dimGrid, dimBlock>>>(…);
这允许块的数量随着向量的大小而变化,以便网格将有足够的线程来覆盖所有向量元素。内核启动时变量n的值将决定网格的维度。如果n等于 1,000,则网格将由四个块组成。如果n等于 4,000,则网格将有 16 个块。在每种情况下,都会有足够的线程来覆盖所有向量元素。一旦vecAddKernel()启动,网格和块尺寸将保持不变,直到整个网格执行完毕。
This allows the number of blocks to vary with the size of the vectors so that the grid will have enough threads to cover all vector elements. The value of variable n at kernel launch time will determine the dimension of the grid. If n is equal to 1,000, the grid will consist of four blocks. If n is equal to 4,000, the grid will have 16 blocks. In each case, there will be enough threads to cover all the vector elements. Once vecAddKernel() is launched, the grid and block dimensions will remain the same until the entire grid finishes execution.
为了方便起见,CUDA C 提供了一个特殊的快捷方式来启动具有一维网格和块的内核。除了使用dim3变量之外,还可以使用算术表达式指定一维网格和块的配置。在这种情况下,CUDA C 编译器只是将算术表达式作为x维度,并假设y和z维度为 1。这给了我们如图 3.14所示的内核启动语句:
For convenience, CUDA C provides a special shortcut for launching a kernel with 1D grids and blocks. Instead of using dim3 variables, one can use arithmetic expressions to specify the configuration of 1D grids and blocks. In this case, the CUDA C compiler simply takes the arithmetic expression as the x dimensions and assumes that the y and z dimensions are 1. This gives us the kernel launch statement shown in Figure 3.14:
vecAddKernel<<<ceil(n/256.0), 256>>>(…);
vecAddKernel<<<ceil(n/256.0), 256>>>(…);
在内核函数中,预定义变量gridDim和blockDim的x字段根据执行配置参数进行预初始化。例如,如果n等于 4,000,则vectAddkernel内核函数中对gridDim.x和blockDim.x 的引用将分别得到 16 和 256。请注意,与主机代码中的dim3变量不同,内核函数中这些变量的名称是 CUDA C 规范的一部分,无法更改。也就是说,核函数中的gridDim和blockDim变量始终反映网格和块的尺寸。
Within the kernel function, the x field of the predefined variables gridDim and blockDim are preinitialized according to the execution configuration parameters. For example, if n is equal to 4,000, references to gridDim.x and blockDim.x in the vectAddkernel kernel function will result in 16 and 256, respectively. Note that unlike the dim3 variables in the host code, the names of these variables within the kernel functions are part of the CUDA C specification and cannot be changed. That is, the gridDim and blockDim variables in the kernel function always reflect the dimensions of the grid and the blocks.
在 CUDA C 中, gridDim.x、gridDim.y和gridDim.z的允许值范围为 1 到 65,536。块中的所有线程共享相同的blockIdx.x、blockIdx.y和blockIdx.z值。其中,blockIdx.x 的取值范围为 0 到gridDim.x-1,blockIdx.y 的取值范围为 0 到gridDim.y-1,blockIdx.z 的取值范围为 0 到gridDim.z-1。在本书的其余部分中,我们将使用符号 ( x , y , z ) 表示 3D 网格,其中x方向上有x块,y方向上有y块,z方向上有z块。
In CUDA C, the allowed values of gridDim.x, gridDim.y, and gridDim.z range from 1 to 65,536. All threads in a block share the same blockIdx.x, blockIdx.y, and blockIdx.z values. Among all blocks, the blockIdx.x value ranges between 0 and gridDim.x-1, the blockIdx.y value between 0 and gridDim.y-1, and the blockIdx.z value between 0 and gridDim.z-1. For the rest of this book, we will use the notation (x, y, z) for a 3D grid with x blocks in the x direction, y blocks in the y direction, and z blocks in the z direction.
现在我们将注意力转向块的配置。块被组织成 3D 线程数组。可以通过将z维度设置为 1 来创建二维块。可以通过将y维度和z维度设置为 1 来创建一维块,如vectorAddkernel示例中所示。正如我们之前提到的,网格中的所有块都具有相同的尺寸。块的每个维度中的线程数由内核启动时的第二个执行配置参数指定。在内核中,此配置参数可以作为预定义变量blockDim的x、y和z字段进行访问。
We now turn our attention to the configuration of blocks. Blocks are organized into 3D arrays of threads. Two-dimensional blocks can be created by setting the z dimension to 1. One-dimensional blocks can be created by setting both the y and z dimensions to 1, as in the vectorAddkernel example. As we mentioned before, all blocks in a grid have the same dimensions. The number of threads in each dimension of a block is specified by the second execution configuration parameter at the kernel launch. Within the kernel, this configuration parameter can be accessed as the x, y, and z fields of the predefined variable blockDim.
块的总大小限制为 1,024 个线程,只要线程总数不超过 1,024,就可以灵活地将这些元素分配到三个维度。例如,(512, 1, 1)、(8, 16, 4) 和 (32, 16, 2) 都是允许的blockDim值,但 (32, 32, 2) 是不允许的,因为线程总数将超过 1,024。2
The total size of a block is limited to 1,024 threads, with flexibility in distributing these elements into the three dimensions as long as the total number of threads does not exceed 1,024. For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all allowable blockDim values, but (32, 32, 2) is not allowable since the total number of threads would exceed 1,024.2
请注意,网格的维度可以比块的维度更高,反之亦然。例如,图 4.1显示了一个 2D 的小玩具示例由 3D (4, 2, 2) 块组成的 (2, 2, 1) 网格。可以使用以下主机代码生成网格:
Note that the grid can have higher dimensionality than its blocks and vice versa. For example, Figure 4.1 shows a small toy example of a 2D (2, 2, 1) grid that consists of 3D (4, 2, 2) blocks. The grid can be generated with the following host code:
图 4.1 CUDA 网格组织的多维示例。
Figure 4.1 A multidimensional example of CUDA grid organization.
暗淡3暗淡块(2,2,1);
dim3 dimBlock(2, 2, 1);
昏暗3 昏暗网格(4, 2, 2);
dim3 dimGrid(4, 2, 2);
KernelFunction<<<dimGrid, dimBlock>>>(…);
KernelFunction<<<dimGrid, dimBlock>>>(…);
网格由组织成 2×2 阵列的四个块组成。图 4.1中的每个块都标有 ( blockIdx.y , blockIdx.x )。例如, block(1,0) 具有blockIdx.y=1和blockIdx.x=0。请注意,标签的顺序是最高维度排在前面。这与配置参数中使用的顺序相反,其中最低维度排在第一位。当我们说明在访问多维数组时线程坐标到数据索引的映射时,这种用于标记线程的相反顺序效果更好。
The grid consists of four blocks organized into a 2×2 array. Each block in Figure 4.1 is labeled with (blockIdx.y, blockIdx.x). For example, block(1,0) has blockIdx.y=1 and blockIdx.x=0. Note that the ordering of the labels is such that the highest dimension comes first. This is reverse of the ordering used in the configuration parameters where the lowest dimension comes first. This reversed ordering for labeling threads works better when we illustrate the mapping of thread coordinates into data indexes in accessing multidimensional arrays.
每个threadIdx还包含三个字段:x坐标threadId.x、y坐标threadIdx.y和z坐标threadIdx.z。图 4.1说明了块内线程的组织。在此示例中,每个块被组织成 4×2×2 线程阵列。由于网格内的所有块都具有相同的尺寸,因此我们只需要显示其中之一。图 4.1展开 block(1,1) 以显示其16 个线程。例如, thread(1,0,2) 具有threadIdx.z=1、threadIdx.y=0和threadIdx.x=2。请注意,在此示例中,我们有四个块,每个块有 16 个线程,网格中总共有 64 个线程。我们使用这些小数字是为了保持插图简单。典型的 CUDA 网格包含数千到数百万个线程。
Each threadIdx also consists of three fields: the x coordinate threadId.x, the y coordinate threadIdx.y, and the z coordinate threadIdx.z. Figure 4.1 illustrates the organization of threads within a block. In this example, each block is organized into 4×2×2 arrays of threads. Since all blocks within a grid have the same dimensions, we only need to show one of them. Figure 4.1 expands block(1,1) to show its 16 threads. For example, thread(1,0,2) has threadIdx.z=1, threadIdx.y=0, and threadIdx.x=2. Note that in this example, we have four blocks of 16 threads each, with a grand total of 64 threads in the grid. We use these small numbers to keep the illustration simple. Typical CUDA grids contain thousands to millions of threads.
1D、2D 或 3D 线程组织的选择通常基于数据的性质。例如,图片是像素的二维阵列。使用由 2D 块组成的 2D 网格来处理图片中的像素通常很方便。图 4.2显示了处理 76×62 图片的这种安排(水平或x方向有 76 个像素,垂直或y方向有 62 个像素)。假设我们决定使用 16×16 块,x方向有 16 个线程, y方向有 16 个线程。我们需要x方向上的 5 个块和y方向上的 4 个块,这会导致 5×4=20 个块,如图4.2所示。粗线标记了块边界。阴影区域描绘了覆盖像素的线。请注意,我们在x方向上有四个额外的线程,在y方向上有两个额外的线程。也就是说,我们将生成80×64个线程来处理76×62像素。这类似于以下情况:图 3.10中的1D vecAddKernel使用四个 256 线程块处理 1,000 元素向量。回想一下,需要一个if语句来防止额外的 24 个线程生效。类似地,我们应该期望图片处理内核函数会有if语句来测试线程索引threadIdx.x和threadIdx.y是否落在有效的像素范围内。
The choice of 1D, 2D, or 3D thread organizations is usually based on the nature of the data. For example, pictures are a 2D array of pixels. It is often convenient to use a 2D grid that consists of 2D blocks to process the pixels in a picture. Figure 4.2 shows such an arrangement for processing a 76×62 picture (76 pixels in the horizontal or x direction and 62 pixels in the vertical or y direction). Assume that we decided to use a 16×16 block, with 16 threads in the x direction and 16 threads in the y direction. We will need five blocks in the x direction and four blocks in the y direction, which results in 5×4=20 blocks as shown in Figure 4.2. The heavy lines mark the block boundaries. The shaded area depicts the threads that cover pixels. Note that we have four extra threads in the x direction and two extra threads in the y direction. That is, we will generate 80×64 threads to process 76×62 pixels. This is similar to the situation where a 1,000-element vector is processed by the 1D vecAddKernel in Figure 3.10 using four 256-thread blocks. Recall that an if statement is needed to prevent the extra 24 threads from taking effect. Analogously, we should expect that the picture processing kernel function will have if statements to test whether the thread indices threadIdx.x and threadIdx.y fall within the valid range of pixels.
图4.2使用二维网格处理图片。
Figure 4.2 Using a 2D grid to process a picture.
假设主机代码使用整型变量n来跟踪x方向上的像素数,并使用另一个整型变量m来跟踪y方向上的像素数。我们进一步假设输入图片数据已被复制到设备内存中,并且可以通过指针变量d_Pin进行访问。输出图片已分配在设备内存中,可以通过指针变量d_Pout进行访问。以下主机代码可用于启动 2D 内核来处理图片:
Assume that the host code uses an integer variable n to track the number of pixels in the x direction, and another integer variable m to track the number of pixels in the y direction. We further assume that the input picture data has been copied to the device memory and can be accessed through a pointer variable d_Pin. The output picture has been allocated in the device memory and can be accessed through a pointer variable d_Pout. The following host code can be used to launch a 2D kernel to process the picture:
dim3 dimBlock(ceil(n/16.0), ceil(m/16.0), 1);
dim3 dimBlock(ceil(n/16.0), ceil(m/16.0), 1);
昏暗3 昏暗网格(16, 16, 1);
dim3 dimGrid(16, 16, 1);
pictureKernel<<<dimGrid, dimBlock>>>(d_Pin, d_Pout, n, m);
pictureKernel<<<dimGrid, dimBlock>>>(d_Pin, d_Pout, n, m);
在此示例中,为简单起见,我们假设块的尺寸固定为 16×16。另一方面,网格的尺寸取决于图片的尺寸。为了处理 2,000×1,500(3 M 像素)图片,我们将生成 14,100 个块,其中x方向 150 个, y方向94 个。在内核函数中,对内置变量gridDim.x、gridDim.y、blockDim.x和blockDim.y的引用将分别得到 150、94、16 和 16。
In this example, we assume for simplicity that the dimensions of the blocks are fixed at 16×16. The dimensions of the grid, on the other hand, depend on the dimensions of the picture. To process a 2,000×1,500 (3 M pixel) picture, we will generate 14,100 blocks, 150 in the x direction and 94 in the y direction. Within the kernel function, references to built-in variables gridDim.x, gridDim.y, blockDim.x, and blockDim.y will result in 150, 94, 16, and 16, respectively.
在展示内核代码之前,我们需要首先了解 C 语句如何访问动态分配的多维数组的元素。理想情况下,我们希望将d_Pin作为二维数组进行访问,其中第 j行和第i列的元素可以作为d_Pin[j][i]进行访问。然而,开发 CUDA C 所依据的 ANSI C 标准要求在编译时知道d_Pin中的列数。不幸的是,对于动态分配的数组,编译器时不知道此信息。事实上,使用动态分配数组的部分原因是允许这些数组的大小和维数在运行时根据数据大小而变化。因此,动态分配的二维数组中的列数信息在编译时是未知的。因此,程序员需要显式地将动态分配的 2D 数组线性化或“展平”为当前 CUDA C 中的等效 1D 数组。请注意,较新的 C99 标准允许动态分配的数组使用多维语法。未来的 CUDA C 版本可能会支持动态分配数组的多维语法。
Before we show the kernel code, we need to first understand how C statements access elements of dynamically allocated multidimensional arrays. Ideally, we would like to access d_Pin as a 2D array where an element at row j and column i can be accessed as d_Pin[j][i]. However, the ANSI C standard based on which CUDA C was developed requires that the number of columns in d_Pin be known at compile time. Unfortunately, this information is not known at compiler time for dynamically allocated arrays. In fact, part of the reason why one uses dynamically allocated arrays is to allow the sizes and dimensions of these arrays to vary according to data size at runtime. Thus, the information on the number of columns in a dynamically allocated 2D array is not known at compile time by design. As a result, programmers need to explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equivalent 1D array in the current CUDA C. Note that the newer C99 standard allows multidimensional syntax for dynamically allocated arrays. It is likely that future CUDA C versions may support multidimensional syntax for dynamically allocated arrays.
内存空间
Memory Space
内存空间是现代计算机中处理器如何访问其内存的简化视图。内存空间通常与每个正在运行的应用程序相关联。应用程序要处理的数据和为应用程序执行的指令存储在其存储空间中的位置中。每个位置通常可以容纳一个字节并具有一个地址。需要多个字节的变量(浮点型 4 个字节,双精度型 8 个字节)存储在连续的字节位置中。处理器给出从内存空间访问数据值时所需的起始地址(起始字节位置的地址)和字节数。
Memory space is a simplified view of how a processor accesses its memory in modern computers. A memory space is usually associated with each running application. The data to be processed by an application and instructions executed for the application are stored in locations in its memory space. Each location typically can accommodate a byte and has an address. Variables that require multiple bytes—4 bytes for float and 8 bytes for double—are stored in consecutive byte locations. The processor gives the starting address (address of the starting byte location) and the number of bytes needed when accessing a data value from the memory space.
内存空间中的位置就像电话系统中的电话一样,每个人都有一个唯一的电话号码。大多数现代计算机至少有 4 GB 大小的位置,其中每个 G 为 1,073,741,824 (2 30 )。所有位置都标有范围从 0 到最大数字的地址。由于每个位置只有一个地址,因此我们说内存空间具有“扁平”组织。因此,所有多维数组最终都会“展平”为等效的一维数组。虽然 C 程序员可以使用多维语法来访问多维数组的元素,但编译器会将这些访问转换为指向数组起始元素的基指针,以及根据这些多维索引计算出的偏移量。
The locations in a memory space are like phones in a telephone system where everyone has a unique phone number. Most modern computers have at least 4 GB-sized locations, where each G is 1,073,741,824 (230). All locations are labeled with an address that ranges from 0 to the largest number. Since there is only one address for every location, we say that the memory space has a “flat” organization. So, all multidimensional arrays are ultimately “flattened” into equivalent 1D arrays. While a C programmer can use a multidimensional syntax to access an element of a multidimensional array, the compiler translates these accesses into a base pointer that points to the beginning element of the array, along with an offset calculated from these multidimensional indices.
实际上,C 中的所有多维数组都是线性化的。这是由于现代计算机中使用了“平面”内存空间(请参阅“内存空间”侧边栏)。在静态分配数组的情况下,编译器允许程序员使用更高维的索引语法(例如d_Pin[j][i])来访问其元素。在底层,编译器将它们线性化为等效的一维数组,并将多维索引语法转换为一维偏移量。在动态分配数组的情况下,由于缺乏维度信息,当前的 CUDA C 编译器将此类转换的工作留给了程序员。
In reality, all multidimensional arrays in C are linearized. This is due to the use of a “flat” memory space in modern computers (see “Memory Space” sidebar). In the case of statically allocated arrays, the compilers allow the programmers to use higher-dimensional indexing syntax such as d_Pin[j][i] to access their elements. Under the hood, the compiler linearizes them into an equivalent 1D array and translates the multidimensional indexing syntax into a 1D offset. In the case of dynamically allocated arrays, the current CUDA C compiler leaves the work of such translation to the programmers due to lack of dimensional information.
至少有两种方法可以线性化二维数组。一种是将同一行的所有元素放置到连续的位置。然后将这些行依次放入内存空间中。这种排列称为行优先布局,如图4.3所示。为了增加可读性,我们将使用 M j,i来表示第 j行第i列的M元素。 M j,i相当于 C 表达式M[j][i],但可读性稍好一些。图 4.3显示了一个示例,其中将 4×4 矩阵 M 线性化为 16 元素的一维数组,首先是第 0 行的所有元素,然后是第 1 行的四个元素,依此类推。因此,M 的一维等效索引j行i列的元素为j ×4+ i。j ×4项会跳过j行之前的行的所有元素。然后,第i项在第j行的部分中选择正确的元素。例如,M 2,1的一维索引是2×4+1=9。这如图 4.3所示,其中 M 9是等价于 M 2,1 的一维。这是 C 编译器线性化 2D 数组的方式。
There are at least two ways one can linearize a 2D array. One is to place all elements of the same row into consecutive locations. The rows are then placed one after another into the memory space. This arrangement, called row-major layout, is illustrated in Figure 4.3. To increase the readability, we will use Mj,i to denote an M element at the j row and the i column. Mj,i is equivalent to the C expression M[j][i] but slightly more readable. Figure 4.3 shows an example where a 4×4 matrix M is linearized into a 16-element 1D array, with all elements of row 0 first, followed by the four elements of row 1, etc. Therefore, the 1D equivalent index for the M element in row j and column i is j×4+i. The j×4 term skips over all elements of the rows before row j. The i term then selects the right element within the section for row j. For example, the 1D index for M2,1 is 2×4+1=9. This is illustrated in Figure 4.3, where M9 is the 1D equivalent to M2,1. This is the way C compilers linearize 2D arrays.
图 4.3 2D C 数组的行优先布局。结果是一个等效的一维数组,可通过索引表达式Row*Width+Col访问位于每行Width元素数组的第Row行和Col th列中的元素。
Figure 4.3 Row-major layout for a 2D C array. The result is an equivalent 1D array accessed by an index expression Row∗Width+Col for an element that is in the Rowth row and Colth column of an array of Width elements in each row.
线性化二维数组的另一种方法是将同一列的所有元素放置到连续的位置。然后将这些列依次放入内存空间中。这种排列称为列主布局,由 FORTRAN 编译器使用。请注意,二维数组的列优先布局等效于其转置形式的行优先布局。我们不会在这方面花费更多时间,只是要提到以前主要编程经验是使用 FORTRAN 的读者应该知道 CUDA C 使用行优先布局而不是列优先布局。此外,许多设计供 FORTRAN 程序使用的 C 库使用列优先布局来匹配 FORTRAN 编译器布局。因此,这些库的手册页,例如《基本线性代数子程序》(请参阅“线性代数函数”侧边栏),通常会告诉用户如果从 C 程序调用这些库,则需要转置输入数组。
Another way to linearize a 2D array is to place all elements of the same column into consecutive locations. The columns are then placed one after another into the memory space. This arrangement, called column-major layout, is used by FORTRAN compilers. Note that the column-major layout of a 2D array is equivalent to the row-major layout of its transposed form. We will not spend more time on this except mentioning that readers whose primary previous programming experience was with FORTRAN should be aware that CUDA C uses row-major layout rather than column-major layout. Also, many C libraries that are designed to be used by FORTRAN programs use column-major layout to match the FORTRAN compiler layout. As a result, the manual pages for these libraries, such as Basic Linear Algebra Subprograms (see “Linear Algebra Functions” sidebar), usually tell the users to transpose the input arrays if they call these libraries from C programs.
现在我们准备研究pictureKernel()的源代码,如图4.4所示。我们假设内核会将图片中的每个像素值缩放 2.0 倍。内核代码在概念上非常简单。水平方向一共有blockDim.x*gridDim.x个线程。正如我们在vecAddKernel()示例中了解到的,表达式Col = blockIdx.x*blockDim.x+threadIdx.x生成从 0 到blockDim.x*gridDim.x–1 的每个整数值。我们知道gridDim.x*blockDim.x大于或等于n。我们的线程数至少与水平方向上的像素数一样多。同样,我们也知道,至少有与垂直方向上的像素数一样多的线程。因此,只要我们测试并确保只有Row和Col值都在范围内的线程,即(Col<n) && (Row<m),我们就能够覆盖图片中的每个像素。由于一行中有n 个像素,我们可以将行Row列Col的像素的一维索引生成为Row*n + Col。该一维索引用于读取d_Pin数组并写入d_Pout数组。
We are now ready to study the source code of pictureKernel(), shown in Figure 4.4. Let’s assume that the kernel will scale every pixel value in the picture by a factor of 2.0. The kernel code is conceptually quite simple. There are a total of blockDim.x∗gridDim.x threads in the horizontal direction. As we learned in the vecAddKernel() example, the expression Col= blockIdx.x∗blockDim.x+threadIdx.x generates every integer value from 0 to blockDim.x∗gridDim.x–1. We know that gridDim.x∗blockDim.x is greater than or equal to n. We have at least as many threads as the number of pixels in the horizontal direction. Similarly, we also know that there are at least as many threads as the number of pixels in the vertical direction. Therefore, as long as we test and make sure only the threads with both Row and Col values are within range, that is (Col<n) && (Row<m), we will be able to cover every pixel in the picture. Since there are n pixels in a row, we can generate the 1D index for the pixel at row Row and column Col as Row∗n+Col. This 1D index is used to read from the d_Pin array and write the d_Pout array.
图 4.4 pictureKernel()的源代码显示了映射到数据模式的 2D 线程。
Figure 4.4 Source code of pictureKernel() showing a 2D thread mapping to a data pattern.
图 4.5说明了处理 76×62 示例时pictureKernel()的执行情况。假设我们使用 16×16 块,启动pictureKernel()会生成 80×64 个线程。网格将有 20 个块,其中水平方向 5 个,垂直方向 4 个。在执行过程中,块的执行行为将填充四种不同情况之一,如图4.5所示的四个主要区域。
Figure 4.5 illustrates the execution of pictureKernel() when processing our 76×62 example. Assuming that we use 16×16 blocks, launching pictureKernel() generates 80×64 threads. The grid will have 20 blocks, 5 in the horizontal direction and 4 in the vertical direction. During the execution, the execution behavior of blocks will fill into one of four different cases, shown as four major areas in Figure 4.5.
图 4.5用 16×16 块覆盖 76×62 图片。
Figure 4.5 Covering a 76×62 picture with 16×16 blocks.
第一个区域在图 4.5中标记为 1 ,由属于覆盖图片中大部分像素的 12 个块的线程组成。这些线程的Col和Row值都在范围内;所有这些线程都将通过if语句测试并处理图片暗阴影区域中的像素。也就是说,每个块中的所有 16×16=256 个线程都会处理像素。
The first area, marked as 1 in Figure 4.5, consists of the threads that belong to the 12 blocks covering the majority of pixels in the picture. Both Col and Row values of these threads are within range; all these threads will pass the if statement test and process pixels in the dark-shaded area of the picture. That is, all 16×16=256 threads in each block will process pixels.
第二个区域,在图 4.5中标记为 2 ,包含属于覆盖右上角的中阴影区域中的 3 个块的线程。图片的像素。尽管这些线程的Row值始终在范围内,但其中一些线程的Col值超过了n值 (76)。这是因为水平方向的线程数始终是程序员选择的blockDim.x值的倍数(本例中为 16)。覆盖 76 个像素所需的 16 的最小倍数是 80。因此,每行中的 12 个线程将在范围内找到其Col值并处理像素。另一方面,每行中有 4 个线程会发现它们的Col值超出范围,从而导致if语句条件失败。这些线程不会处理任何像素。总体而言,16×16=256 个线程中的 12×16=192 个将处理像素。
The second area, marked as 2 in Figure 4.5, contains the threads that belong to the 3 blocks in the medium-shaded area covering the upper-right pixels of the picture. Although the Row values of these threads are always within range, the Col values of some of them exceed the n value (76). This is because the number of threads in the horizontal direction is always a multiple of the blockDim.x value chosen by the programmer (16 in this case). The smallest multiple of 16 needed to cover 76 pixels is 80. As a result, 12 threads in each row will find their Col values within range and will process pixels. On the other hand, 4 threads in each row will find their Col values out of range, and thus fail the if statement condition. These threads will not process any pixels. Overall, 12×16=192 out of the 16×16=256 threads will process pixels.
第三个区域,在图 4.5中标记为 3 ,占覆盖图片中阴影区域的 3 个左下块。尽管这些线程的Col值始终在范围内,但其中一些线程的Row值超出了m值(62)。这是因为垂直方向上的线程数始终是程序员选择的blockDim.y值的倍数(本例中为 16)。 16 覆盖 62 的最小倍数是 64。因此,每列中的 14 个线程将在范围内找到其Row值并处理像素。另一方面,每列中的2个线程将导致区域2的if语句失败,并且不会处理任何像素; 256 个线程中的 16×14=224 个将处理像素。
The third area, marked as 3 in Figure 4.5, accounts for the 3 lower-left blocks covering the medium-shaded area of the picture. Although the Col values of these threads are always within range, the Row values of some of them exceed the m value (62). This is because the number of threads in the vertical direction is always multiples of the blockDim.y value chosen by the programmer (16 in this case). The smallest multiple of 16 to cover 62 is 64. As a result, 14 threads in each column will find their Row values within range and will process pixels. On the other hand, 2 threads in each column will fail the if statement of area 2, and will not process any pixels; 16×14=224 out of the 256 threads will process pixels.
第四个区域,在图 4.5中标记为 4 ,包含覆盖图片右下浅色区域的线。与区域 2 类似,前 14 行中每一行中的 4 个线程都会发现其Col值超出范围。与区域 3 类似,该块的整个底部两行都会发现其Row值超出范围。因此,16×16=256 个线程中仅允许 14×12=168 个线程处理线程。
The fourth area, marked as 4 in Figure 4.5, contains the threads that cover the lower-right light-shaded area of the picture. Similar to area 2, 4 threads in each of the top 14 rows will find their Col values out of range. Similar to area 3, the entire bottom two rows of this block will find their Row values out of range. So, only 14×12=168 of the 16×16=256 threads will be allowed to process threads.
通过在线性化数组时包含另一个维度,我们可以轻松地将 2D 数组的讨论扩展到 3D 数组。这是通过将阵列的每个“平面”依次放置来完成的。假设程序员使用变量m和n来跟踪 3D 数组中的行数和列数。程序员还需要在启动内核时确定blockDim.z和gridDim.z的值。在内核中,数组索引会涉及另一个全局索引:
We can easily extend our discussion of 2D arrays to 3D arrays by including another dimension when we linearize arrays. This is done by placing each “plane” of the array one after another. Assume that the programmer uses variables m and n to track the number of rows and columns in a 3D array. The programmer also needs to determine the values of blockDim.z and gridDim.z when launching a kernel. In the kernel, the array index will involve another global index:
int Plane = blockIdx.z*blockDim.z + threadIdx.z
int Plane = blockIdx.z∗blockDim.z + threadIdx.z
对数组P的线性化访问将采用P[Plane*m*n + Row*n+Col]的形式。当然,需要测试所有三个全局索引(Plane、Row和Col)是否都在数组的有效范围内。
The linearized access to an array P will be in the form of P[Plane∗m∗n + Row ∗ n + Col]. One would of course need to test if all the three global indices—Plane, Row, and Col—fall within the valid range of the array.
线性代数函数
Linear Algebra Functions
线性代数运算广泛应用于科学和工程应用中。根据广泛使用的基本线性代数子程序(BLAS)(执行基本代数运算的发布库的事实标准),线性代数函数分为三个级别。随着级别的提高,函数执行的操作量也会增加。 1 级函数执行y = α x + y形式的向量运算,其中x和y是向量,α是标量。我们的向量加法示例是α =1的 1 级函数的特例。 2 级函数执行y = α Ax + β y形式的矩阵向量运算,其中A是矩阵,x和y是向量,α和β是标量。我们将在稀疏线性代数的背景下研究一种形式的 2 级函数。 3 级函数以C = α AB + β C的形式执行矩阵-矩阵运算,其中A、B和C是矩阵,α和β是标量。我们的矩阵-矩阵乘法示例是 3 级函数的特例,其中α =1 且β =0。这些 BLAS 函数非常重要,因为它们被用作高级代数函数(例如线性系统求解器和特征值分析)的基本构建块。正如我们稍后将讨论的,BLAS 函数的不同实现的性能在顺序计算机和并行计算机中可能存在数量级的差异。
Linear algebra operations are widely used in science and engineering applications. According to the widely used basic linear algebra subprograms (BLAS), a de facto standard for publishing libraries that perform basic algebra operations, there are three levels of linear algebra functions. As the level increases, the amount of operations performed by the function increases. Level-1 functions perform vector operations of the form y=αx+y, where x and y are vectors and α is a scalar. Our vector addition example is a special case of a level-1 function with α=1. Level-2 functions perform matrix–vector operations of the form y=αAx+βy, where A is a matrix, x and y are vectors, and α and β are scalars. We will be studying a form of level-2 function in the context of sparse linear algebra. Level-3 functions perform matrix–matrix operations in the form of C=αAB+βC, where A, B, and C are matrices and α and β are scalars. Our matrix–matrix multiplication example is a special case of a level-3 function where α=1 and β=0. These BLAS functions are important because they are used as basic building blocks of higher-level algebraic functions such as linear system solvers and eigenvalue analysis. As we will discuss later, the performance of different implementations of BLAS functions can vary by orders of magnitude in both sequential and parallel computers.
到目前为止,我们已经研究了vecAddkernel()和pictureKernel(),其中每个线程仅执行一次浮点算术运算一个数组元素。读者应该问一个显而易见的问题:所有 CUDA 线程都只执行这么少量的操作吗?答案是不。大多数真正的内核都让每个线程执行更多的算术运算并体现复杂的控制流。选择这两个简单的内核来教授使用threadIdx、blockIdx、blockDim和gridDim变量将线程映射到数据。特别是,我们引入了以下使用全局索引值的模式,以确保二维数组中的每个有效数据元素都由唯一的线程覆盖:
Up to this point, we have studied vecAddkernel() and pictureKernel() where each thread performs only one floating-point arithmetic operation on one array element. Readers should ask the obvious question: Do all CUDA threads perform only such a trivial amount of operation? The answer is no. Most real kernels have each thread to perform many more arithmetic operations and embody sophisticated control flows. These two simple kernels were selected for teaching the mapping of threads to data using threadIdx, blockIdx, blockDim, and gridDim variables. In particular, we introduce the following pattern of using global index values to ensure that every valid data element in a 2D array is covered by a unique thread:
行 = blockIdx.x*blockDim.x+threadIdx.x
Row = blockIdx.x∗blockDim.x+threadIdx.x
和
and
Col = blockIdx.y*blockDim.y+threadIdx.y
Col = blockIdx.y∗blockDim.y+threadIdx.y
我们还使用vecAddKernel()和pictureKernel()来引入我们创建的线程数量是块维度的倍数的现象。因此,我们最终可能会得到比数据元素更多的线程。并非所有线程都会处理数组的元素。我们使用if语句来测试线程的全局索引值是否在有效范围内。现在我们了解了线程到数据的映射,我们可以了解执行更复杂计算的内核。
We also used vecAddKernel() and pictureKernel() to introduce the phenomenon that the number of threads that we create is a multiple of the block dimension. As a result, we will likely end up with more threads than data elements. Not all threads will process elements of an array. We use an if statement to test if the global index values of a thread are within the valid range. Now that we understand the mapping of threads to data, we are in a position to understand kernels that perform more complex computation.
I×J 矩阵d_M和 J×K 矩阵d_N之间的矩阵-矩阵乘法产生 I×K 矩阵d_P。矩阵-矩阵乘法是 BLAS 标准的重要组成部分(请参阅“线性代数函数”边栏)。为简单起见,我们将讨论限制在方阵上,其中 I=J=K。我们将为I、J 和 K使用可变宽度。
Matrix–matrix multiplication between an I×J matrix d_M and a J×K matrix d_N produces an I×K matrix d_P. Matrix–matrix multiplication is an important component of the BLAS standard (see “Linear Algebra Functions” sidebar). For simplicity, we will limit our discussion to square matrices, where I=J=K. We will use variable Width for I, J, and K.
当执行矩阵-矩阵乘法时,乘积矩阵d_P的每个元素都是d_M的行和d_N的列的内积。我们将继续使用约定,其中d_P Row,Col是Row行和Col列处的元素。如图4.6所示,d_P Row, Col ( d_P中的小方块)是d_M的Row行( d_M中的水平条)与d_N的Col列( d_N中的垂直条)的内积)。两个向量之间的内积是对应元素的乘积之和。即,d _P Row,Col = Σ d_M Row,k ∗ d_N k,Col,对于k = 0, 1, … Width-1。例如,
When performing a matrix–matrix multiplication, each element of the product matrix d_P is an inner product of a row of d_M and a column of d_N. We will continue to use the convention where d_PRow,Col is the element at Row row and Col column. As shown in Figure 4.6, d_PRow, Col (the small square in d_P) is the inner product of the Row row of d_M (shown as the horizontal strip in d_M) and the Col column of d_N (shown as the vertical strip in d_N). The inner product between two vectors is the sum of products of corresponding elements. That is, d_PRow,Col=∑d_MRow,k ∗d_Nk,Col, for k = 0, 1, … Width-1. For example,
图 4.6通过平铺d_P使用多个块进行矩阵乘法。
Figure 4.6 Matrix multiplication using multiple blocks by tiling d_P.
d_P 1,5 = d_M 1 , 0 *d_N 0,5 + d_M 1,1 *d_N 1,5 + d_M 1,2 * d_N 2,5 + …。 + d_M 1,宽度-1 *d_N宽度-1,5
d_P1,5 = d_M1, 0∗d_N0,5 + d_M1,1∗d_N1,5 + d_M1,2∗ d_N2,5 +…. + d_M1,Width-1∗d_NWidth-1,5
我们使用与pictureKernel()相同的方法将线程映射到d_P元素。即每个线程负责计算一个d_P元素。线程计算的d_P元素位于行blockIdx.y*blockDim.y+threadIdx.y和列blockIdx中。x*blockDim.x+threadIdx.x。图 4.7显示了基于线程到数据映射的内核源代码。读者应该立即看到计算Row、Col的熟悉模式,以及测试Row和Col是否都在范围内的if语句。这些语句与pictureKernel()中的对应语句几乎相同。唯一显着的区别是我们假设矩阵MulKernel()为方阵,因此我们将n和m替换为Width。
We map threads to d_P elements with the same approach as what we used for pictureKernel(). That is, each thread is responsible for calculating one d_P element. The d_P element calculated by a thread is in row blockIdx.y∗blockDim.y+threadIdx.y and in column blockIdx.x∗blockDim.x+threadIdx.x. Figure 4.7 shows the source code of the kernel based on this thread-to-data mapping. Readers should immediately see the familiar pattern of calculating Row, Col, and the if statement testing if Row and Col are both within range. These statements are almost identical to their counterparts in pictureKernel(). The only significant difference is that we are assuming square matrices for the matrixMulKernel() so we replace both n and m with Width.
图 4.7一个简单的矩阵-矩阵乘法内核,使用一个线程来计算每个d_P元素。
Figure 4.7 A simple matrix–matrix multiplication kernel using one thread to compute each d_P element.
通过线程到数据的映射,我们有效地将d_P划分为方形图块,其中之一如图4.6中的大方形所示。某些尺寸可能更适合一个设备,而另一些尺寸可能更适合另一个设备。这就是为什么在实际应用中,程序员通常希望将块尺寸保留为主机代码中易于调整的值。常见的做法是声明一个编译时常量,并在主机语句中使用该常量来设置内核启动配置。我们将此编译时常量称为BLOCK_WIDTH。要将BLOCK_WIDTH设置为一个值(例如 16),我们可以在头文件或使用BLOCK_WIDTH的文件开头使用以下 C 语句:
With our thread-to-data mapping, we effectively divide d_P into square tiles, one of which is shown as a large square in Figure 4.6. Some dimension sizes may be better for a device and others may be better for another device. This is why, in real applications, programmers often want to keep the block dimensions as an easily adjustable value in the host code. A common practice is to declare a compile-time constant and use this constant in the host statements for setting the kernel launch configuration. We will refer to this compile-time constant as BLOCK_WIDTH. To set BLOCK_WIDTH to a value, say 16, we can use the following C statement in a header file or the beginning of a file where BLOCK_WIDTH is used:
#定义块宽度16
#define BLOCK_WIDTH 16
在整个源代码中,程序员可以使用名称BLOCK_WIDTH,而不是使用数值。使用命名的编译时常量允许程序员在针对特定硬件进行编译时轻松地将BLOCK_WIDTH设置为不同的值。它还允许自动调整系统通过迭代地将其设置为不同的值来搜索最佳的BLOCK_WIDTH值,并针对感兴趣的硬件进行编译和运行。这种类型的过程通常称为自动调整。在这两种情况下,当改变线程块的尺寸时,源代码可以基本保持不变。
Throughout the source code, instead of using a numerical value, the programmer can use the name BLOCK_WIDTH. Using a named compile-time constant allows the programmer to easily set BLOCK_WIDTH to a different value when compiling for a particular hardware. It also allows an automated tuning system to search for the best BLOCK_WIDTH value by iteratively setting it to different values, compile, and run for the hardware of interest. This type of process is often referred to as autotuning. In both cases, the source code can remain largely unchanged while changing the dimensions of the thread blocks.
图4.8显示了用于启动matrixMulKernel()的主机代码。请注意,配置参数dimGrid 的设置是为了确保对于 Width 和BLOCK_WIDTH值的任意组合, x和y维度上都有足够的线程块来计算所有d_P元素。此外,在初始化dimGrid和dimBlock字段时使用BLOCK_WIDTH常量的名称而不是实际值。这使得程序员可以轻松更改BLOCK_WIDTH值,而无需修改任何其他语句。假设我们的宽度值为 1,000。也就是说,我们需要做 1,000×1,000 矩阵-矩阵乘法。对于BLOCK_WIDTH值为 16,我们将生成 16×16 块。网格中将有 64×64 的块来覆盖所有d_P元素。通过将图 4.8中的#define语句更改为
Figure 4.8 shows the host code to be used to launch the matrixMulKernel(). Note that the configuration parameter dimGrid is set to ensure that for any combination of Width and BLOCK_WIDTH values, there are enough thread blocks in both x and y dimensions to calculate all d_P elements. Also, the name of the BLOCK_WIDTH constant rather than the actual value is used in initializing the fields of dimGrid and dimBlock. This allows the programmer to easily change the BLOCK_WIDTH value without modifying any of the other statements. Assume that we have a Width value of 1,000. That is, we need to do 1,000×1,000 matrix–matrix multiplication. For a BLOCK_WIDTH value of 16, we will generate 16×16 blocks. There will be 64×64 blocks in the grid to cover all d_P elements. By changing the #define statement in Figure 4.8 to
图 4.8使用编译时常量BLOCK_WIDTH来设置其配置参数来启动MatrixMulKernel()的主机代码。
Figure 4.8 Host code for launching the matrixMulKernel() using a compile-time constant BLOCK_WIDTH to set up its configuration parameters.
#定义块宽度32
#define BLOCK_WIDTH 32
我们将生成 32×32 的块。网格中将有 32×32 块。我们可以对内核启动配置进行此更改,而无需更改任何初始化dimGrid和dimBlock的语句。
we will generate 32×32 blocks. There will be 32×32 blocks in the grid. We can make this change to the kernel launch configuration without changing any of the statements that initialize dimGrid and dimBlock.
现在我们将注意力转向每个线程所做的工作。回想一下,d_P Row, Col是d_M的Row行和d_N的Col列的内积。在图 4.7中,我们使用for循环来执行此内积运算。在进入循环之前,我们将局部变量Pvalue初始化为 0。循环的每次迭代都会访问d_M的Row行中的元素和d_N的Col列中的元素,将两个元素相乘,并将乘积累加为P值。
We now turn our attention to the work done by each thread. Recall that d_PRow, Col is the inner product of the Row row of d_M and the Col column of d_N. In Figure 4.7, we use a for loop to perform this inner product operation. Before we enter the loop, we initialize a local variable Pvalue to 0. Each iteration of the loop accesses an element in the Row row of d_M, an element in the Col column of d_N, multiplies the two elements together, and accumulates the product into Pvalue.
我们首先关注在for循环中访问d_M的Row行。回想一下,d_M被线性化为等效的一维数组,其中d_M的行从第 0 行开始依次放置在内存空间中。因此,第 1 行的起始元素是d_M[1*Width],因为我们需要考虑第 0 行的所有元素。一般来说,Row行的起始元素是d_M[Row*Width]。由于一行的所有元素都放置在连续的位置,因此Row行的k元素位于d_M[Row*Width+k]处。这就是我们在图 4.7中使用的。
Let’s first focus on accessing the Row row of d_M within the for loop. Recall that d_M is linearized into an equivalent 1D array where the rows of d_M are placed one after another in the memory space, starting with the 0 row. Therefore, the beginning element of the 1 row is d_M[1∗Width] because we need to account for all elements of the 0 row. In general, the beginning element of the Row row is d_M[Row∗Width]. Since all elements of a row are placed in consecutive locations, the k element of the Row row is at d_M[Row∗Width+k]. This is what we used in Figure 4.7.
我们现在转向访问d_N的Col列。如图4.3所示, Col列的起始元素是第 0 行的Col元素,即d_N[Col]。访问Col列中的每个附加元素需要跳过整行。这是因为接下来同一列的元素实际上是下一行的相同元素。因此,Col列的第 k 个元素是d_N[k*Width+Col]。
We now turn to accessing the Col column of d_N. As shown in Figure 4.3, the beginning element of the Col column is the Col element of the 0 row, which is d_N[Col]. Accessing each additional element in the Col column requires skipping over entire rows. This is because the next element of the same column is actually the same element in the next row. Therefore, the k element of the Col column is d_N[k∗Width+Col].
执行退出for循环后,所有线程的Pvalue变量中都有其d_P元素值。然后,它使用一维等效索引表达式Row*Width+Col写入其d_P元素。同样,此索引模式与pictureKernel()中使用的类似,其中n替换为Width。
After the execution exits the for loop, all threads have their d_P element value in its Pvalue variable. It then uses the 1D equivalent index expression Row∗Width+Col to write its d_P element. Again, this index pattern is similar to that used in the pictureKernel(), with n replaced by Width.
我们现在用一个小例子来说明矩阵-矩阵乘法内核的执行。图 4.9显示了BLOCK_WIDTH = 2的4×4 d_P。较小的尺寸使我们能够将整个示例放入一张图片中。d_P矩阵现在分为四个图块,每个块计算一个图块。 (只要很明显我们正在讨论设备内存阵列,我们就会删除名称中的d_部分以提高可读性。在图 4.10中,我们使用P而不是d_P,因为很明显我们正在讨论设备内存阵列。 )我们通过创建 2×2 线程数组块来实现这一点,每个线程计算一个P元素。在该示例中,block(0,0) 的 thread(0,0) 计算 P 0,0,而 block(1,0) 的 thread(0,0) 计算 P 2,0。很容易验证,可以通过以下公式识别由 block(1,0) 的 thread(0,0) 计算出的P元素:
We now use a small example to illustrate the execution of the matrix–matrix multiplication kernel. Figure 4.9 shows a 4×4 d_P with BLOCK_WIDTH = 2. The small sizes allow us to fit the entire example in one picture. The d_P matrix is now divided into four tiles and each block calculates one tile. (Whenever it is clear that we are discussing a device memory array, we will drop the d_ part of the name to improve readability. In Figure 4.10, we use P instead of d_P since it is clear that we are discussing a device memory array.) We do so by creating blocks that are 2×2 arrays of threads, with each thread calculating one P element. In the example, thread(0,0) of block(0,0) calculates P0,0, whereas thread(0,0) of block(1,0) calculates P2,0. It is easy to verify that one can identify the P element calculated by thread(0,0) of block(1,0) with the formula:
图 4.9 matrixMulKernel()的一个小型执行示例。
Figure 4.9 A small execution example of matrixMulKernel().
图 4.10一个线程块的矩阵乘法动作。为了便于阅读,d_M、d_N和d_P显示为 M、N 和 P。
Figure 4.10 Matrix multiplication actions of one thread block. For readability, d_M, d_N, and d_P are shown as M, N, and P.
读者应该尽可能多地完成索引推导,以适应映射。
Readers should work through the index derivation for as many threads as it takes to become comfortable with the mapping.
MatrixMulKernel()中的Row和Col标识了线程要计算的P元素。 Row还标识M的行,Col标识N的列,用于线程的输入值。图 4.10说明了每个线程块中的乘法操作。对于小矩阵乘法,块(0,0)中的线程产生四个点积。block(0,0) 中 thread(0,0) 的 Row 和 Col 变量为0 * 0 + 0= 0和0*0+0=0。它映射到 P 0,0并计算 M 的第 0 行和 N 的第 0 列的点积。
Row and Col in the matrixMulKernel() identify the P element to be calculated by a thread. Row also identifies the row of M and Col identifies the column of N for input values for the thread. Figure 4.10 illustrates the multiplication actions in each thread block. For the small matrix multiplication, threads in block(0,0) produce four dot products. The Row and Col variables of thread(0,0) in block(0,0) are 0∗0+0= 0 and 0∗0+0=0. It maps to P0,0 and calculates the dot product of row 0 of M and column 0 of N.
现在,我们将针对 block(0,0) 中的 thread(0,0)逐步执行图 4.7中的for循环。在第 0 次迭代期间(k =0),Row*Width+k = 0*4+0 = 0且k*Width+Col = 0*4+0 = 0。因此,我们正在访问d_M[0]和d_N[0],根据图 4.3 ,它们是d_M 0,0和d_N 0,0的一维等效项。请注意,这些确实是d_M的第 0 行和d_N的第 0 列的 0 元素。
We now walk through the execution of the for loop of Figure 4.7 for thread(0,0) in block(0,0). During the 0 iteration (k=0), Row∗Width+k = 0∗4+0 = 0 and k∗Width+Col = 0∗4+0 = 0. Therefore, the we are accessing d_M[0] and d_N[0], which according to Figure 4.3 are the 1D equivalent of d_M0,0 and d_N0,0. Note that these are indeed the 0 elements of row 0 of d_M and column 0 of d_N.
在第一次迭代期间(k =1),Row*Width+k = 0*4+1 = 1和k*Width+Col = 1*4+0 = 4。我们正在访问d_M[1]和d_N[4],根据图 4.3 ,它们是d_M 0,1和d_N 1,0的一维等价物。这些是d_M第0行和d_N第0列的第一个元素。
During the first iteration (k=1), Row∗Width+k = 0∗4+1 = 1 and k∗Width+Col = 1∗4+0 = 4. We are accessing d_M[1] and d_N[4], which according to Figure 4.3 are the 1D equivalent of d_M0,1 and d_N1,0. These are the first elements of row 0 of d_M and column 0 of d_N.
在第二次迭代期间 ( k =2),Row*Width+k = 0*4+2 = 2和k*Width+Col = 8,结果是d_M[2]和d_N[8]。因此,访问的元素是d_M 0,2和d_N 2,0的 1D 等效项。
During the second iteration (k=2), Row∗Width+k = 0∗4+2 = 2 and k∗Width+Col = 8, which results in d_M[2] and d_N[8]. Therefore, the elements accessed are the 1D equivalent of d_M0,2 and d_N2,0.
最后,在第三次迭代 ( k =3) 期间,Row*Width+k = 0*4+3和k*Width+Col = 12,这导致d_M[3]和d_N[12] ,即dM的一维等价物0,3和d_N 3,0。我们现在已经验证了for循环执行d_M的 0 行和d_N的 0 列之间的内积。循环结束后,线程写入d_P[Row*Width+Col],即d_P[0]。这是d_P 0,0的 1D 等价物,因此 block(0,0) 中的 thread(0,0) 成功计算了d_M的 0 行和d_N的 0 列之间的内积,并将结果存入d_P 0,0。
Finally, during the third iteration (k=3), Row∗Width+k = 0∗4+3 and k∗Width+Col = 12, which results in d_M[3] and d_N[12], the 1D equivalent of dM0,3 and d_N3,0. We now have verified that the for loop performs the inner product between the 0 row of d_M and the 0 column of d_N. After the loop, the thread writes d_P[Row∗Width+Col], which is d_P[0]. This is the 1D equivalent of d_P0,0 so thread(0,0) in block(0,0) successfully calculated the inner product between the 0 row of d_M and the 0 column of d_N and deposited the result in d_P0,0.
我们将把它作为练习,让读者手动执行和验证block(0,0) 或其他块中其他线程的for循环。
We will leave it as an exercise for the reader to hand-execute and verify the for loop for other threads in block(0,0) or in other blocks.
请注意,matrixMulKernel()可以处理每个维度最多 16×65,535 个元素的矩阵。在要相乘大于此限制的矩阵的情况下,可以将P矩阵划分为尺寸可以被核覆盖的子矩阵。然后,我们可以使用主机代码迭代启动内核来完成 P 矩阵,或者使用每个线程的内核代码来计算更多 P 元素。
Note that matrixMulKernel()can handle matrices of up to 16×65,535 elements in each dimension. In the situation where matrices larger than this limit are to be multiplied, one can divide up the P matrix into submatrices of which the size can be covered by a kernel. We can then either use the host code to iteratively launch kernels to complete the P matrix or have the kernel code of each thread to calculate more P elements.
到目前为止,我们已经讨论了如何启动内核以供线程网格执行。我们还解释了如何将线程映射到数据结构的某些部分。然而,我们还没有提出任何方法来协调多个线程的执行。我们现在来研究一个基本的协调机制。 CUDA 允许同一块中的线程使用屏障同步函数__syncthreads()协调其活动。请注意,“ __ ”实际上由两个“ _ ”字符组成。当内核函数调用__syncthreads()时,块中的所有线程都将保留在调用位置,直到块中的每个线程都到达该位置。这确保了块中的所有线程都已完成内核执行的一个阶段,然后它们中的任何一个都可以进入下一阶段。我们将在第 5 章中讨论__syncthreads()的一个重要用例。
So far, we have discussed how to launch a kernel for execution by a grid of threads. We have also explained how one can map threads to parts of the data structure. However, we have not yet presented any means to coordinate the execution of multiple threads. We will now study a basic coordination mechanism. CUDA allows threads in the same block to coordinate their activities using a barrier synchronization function __syncthreads(). Note that “__” actually consists of two “_” characters. When a kernel function calls __syncthreads(), all threads in a block will be held at the calling location until every thread in the block reaches the location. This ensures that all threads in a block have completed a phase of their execution of the kernel before any of them can move on to the next phase. We will discuss an important use case of __syncthreads() in Chapter 5.
屏障同步是一种简单且流行的协调并行活动的方法。在现实生活中,我们经常使用屏障同步来协调多人的并行活动。例如,假设四个朋友开车去购物中心。他们都可以去不同的商店购买自己的衣服。这是一项并行活动,并且比他们全部作为一个组并依次访问所有感兴趣的商店的情况要高效得多。然而,在他们离开商场之前,需要进行屏障同步。他们必须等到四个朋友都回到车上才能离开——先完成的需要等待后完成的。如果没有障碍同步,汽车离开时可能会将一个或多个人留在商场内,这会严重损害他们的友谊!
Barrier synchronization is a simple and popular method of coordinating parallel activities. In real life, we often use barrier synchronization to coordinate parallel activities of multiple persons. For example, assume that four friends go to a shopping mall in a car. They can all go to different stores to shop for their own clothes. This is a parallel activity and is much more efficient than if they all remain as a group and sequentially visit all the stores of interest. However, barrier synchronization is needed before they leave the mall. They have to wait until all four friends have returned to the car before they can leave—the ones who finish earlier need to wait for those who finish later. Without the barrier synchronization, one or more persons can be left in the mall when the car leaves, which can seriously damage their friendship!
图 4.11说明了屏障同步的执行过程。块中有N个线程。时间从左向右。有些线程较早到达屏障同步语句,有些则较晚。早到的人会等待迟到的人。当最后一个到达障碍物时,每个人都可以继续执行。通过屏障同步,“没有人被抛在后面。”
Figure 4.11 illustrates the execution of barrier synchronization. There are N threads in the block. Time goes from left to right. Some of the threads reach the barrier synchronization statement early and some of them much later. The ones that reach the barrier early will wait for those that arrive late. When the latest one arrives at the barrier, everyone can continue their execution. With barrier synchronization, “No one is left behind.”
图 4.11屏障同步的执行时序示例。
Figure 4.11 An example execution timing of barrier synchronization.
在 CUDA 中,__syncthreads()语句(如果存在)必须由块中的所有线程执行。当__syncthread()语句放置在if语句中时,块中的所有线程要么执行包含__syncthreads()的路径,要么都不执行。对于if-then-else语句,如果每个路径都有__syncthreads()语句,则块中的所有线程要么执行then路径上的__syncthreads(),要么全部执行else路径。两个__syncthreads()是不同的屏障同步点。如果块中的一个线程执行then路径,另一个线程执行else路径,它们将在不同的屏障同步点等待。他们最终会永远等待对方。程序员有责任编写代码以满足这些要求。
In CUDA, a __syncthreads() statement, if present, must be executed by all threads in a block. When a __syncthread() statement is placed in an if statement, either all threads in a block execute the path that includes the __syncthreads() or none of them does. For an if-then-else statement, if each path has a __syncthreads() statement, either all threads in a block execute the __syncthreads() on the then path or all of them execute the else path. The two __syncthreads() are different barrier synchronization points. If a thread in a block executes the then path and another executes the else path, they would be waiting at different barrier synchronization points. They would end up waiting for each other forever. It is the responsibility of the programmers to write their code so that these requirements are satisfied.
同步能力还对块内的线程施加执行约束。这些线程应该在关闭时间内执行彼此距离较近,以避免等待时间过长。事实上,需要确保屏障同步中涉及的所有线程都可以访问最终到达屏障所需的资源。否则,从未到达屏障同步点的线程可能会导致其他线程永远等待。 CUDA 运行时系统通过将执行资源分配给块中的所有线程作为一个单元来满足此约束。仅当运行时系统已确保块中所有线程完成执行所需的所有资源时,块才能开始执行。当块中的一个线程被分配给执行资源时,同一块中的所有其他线程也被分配给同一资源。这确保了块中所有线程的时间接近性,并防止屏障同步期间出现过多或不确定的等待时间。
The ability to synchronize also imposes execution constraints on threads within a block. These threads should execute in close time proximity with each other to avoid excessively long waiting times. In fact, one needs to make sure that all threads involved in the barrier synchronization have access to the necessary resources to eventually arrive at the barrier. Otherwise, a thread that never arrived at the barrier synchronization point can cause everyone else to wait forever. CUDA runtime systems satisfy this constraint by assigning execution resources to all threads in a block as a unit. A block can begin execution only when the runtime system has secured all the resources needed for all threads in the block to complete execution. When a thread of a block is assigned to an execution resource, all other threads in the same block are also assigned to the same resource. This ensures the time proximity of all threads in a block and prevents excessive or indefinite waiting time during barrier synchronization.
这导致我们在 CUDA 屏障同步设计中进行重大权衡。通过不允许不同块中的线程彼此执行屏障同步,CUDA运行时系统可以以相对于彼此的任何顺序执行块,因为它们都不需要彼此等待。这种灵活性支持可扩展的实现,如图4.12所示,其中时间从上到下进行。在只有少量执行资源的低成本系统中,可以同时执行少量块;一次执行两个块如图 4.12左侧所示。在具有更多执行资源的高端实现中,可以同时执行大量块;一次执行四个块如图 4.12右侧所示。
This leads us to a major trade-off in the design of CUDA barrier synchronization. By not allowing threads in different blocks to perform barrier synchronization with each other, the CUDA runtime system can execute blocks in any order relative to each other since none of them need to wait for each other. This flexibility enables scalable implementations as shown in Figure 4.12, where time progresses from top to bottom. In a low-cost system with only a few execution resources, one can execute a small number of blocks at the same time; two blocks executing at a time is shown on the left side of Figure 4.12. In a high-end implementation with more execution resources, one can execute a large number of blocks at the same time; four blocks executing at a time is shown on the right side of Figure 4.12.
图 4.12块之间缺乏同步约束可以实现 CUDA 程序的透明可扩展性。
Figure 4.12 Lack of synchronization constraints between blocks enables transparent scalability for CUDA programs.
以各种速度执行相同应用程序代码的能力允许根据特定细分市场的成本、功耗和性能要求生产各种实施方案。例如,移动处理器可以缓慢但以极低的功耗执行应用程序,而桌面处理器可以以更高的速度执行相同的应用程序但消耗更多的功率。两者执行完全相同的应用程序,无需更改代码。在具有不同数量执行资源的硬件上执行相同应用程序代码的能力被称为透明可扩展性,它减轻了应用程序开发人员的负担,提高了应用程序的可用性。
The ability to execute the same application code at a wide range of speeds allows the production of a wide range of implementations according to the cost, power, and performance requirements of particular market segments. For example, a mobile processor may execute an application slowly but at extremely low power consumption, and a desktop processor may execute the same application at a higher speed while consuming more power. Both execute exactly the same application program with no change to the code. The ability to execute the same application code on hardware with a different number of execution resources is referred to as transparent scalability, which reduces the burden on application developers and improves the usability of applications.
一旦内核启动,CUDA 运行时系统就会生成相应的线程网格。正如我们在上一节中讨论的,这些线程按块分配给执行资源。在当前一代的硬件中,执行资源被组织到流式多处理器(SM)中。图 4.13说明了可以为每个 SM 分配多个线程块。每个设备对可以分配给每个 SM 的块数量都有限制。例如,CUDA 设备最多可以允许将八个块分配给每个 SM。在同时执行八个块所需的任何一种或多种类型的资源数量不足的情况下,CUDA运行时会自动减少分配给每个SM的块的数量,直到它们的组合资源使用量低于限制。由于 SM 数量有限以及可分配给每个 SM 的块数量有限,因此 CUDA 设备中可主动执行的块数量也受到限制。大多数网格包含的块比这个数量多得多。运行时系统维护需要执行的块的列表,并在 SM 完成执行先前分配给它们的块时将新块分配给 SM。
Once a kernel is launched, the CUDA runtime system generates the corresponding grid of threads. As we discussed in the previous section, these threads are assigned to execution resources on a block-by-block basis. In the current generation of hardware, the execution resources are organized into streaming multiprocessors (SMs). Figure 4.13 illustrates that multiple thread blocks can be assigned to each SM. Each device has a limit on the number of blocks that can be assigned to each SM. For example, a CUDA device may allow up to eight blocks to be assigned to each SM. In situations where there is an insufficient amount of any one or more types of resources needed for the simultaneous execution of eight blocks, the CUDA runtime automatically reduces the number of blocks assigned to each SM until their combined resource usage falls under the limit. With a limited numbers of SMs and a limited number of blocks that can be assigned to each SM, there is a limit on the number of blocks that can be actively executing in a CUDA device. Most grids contain many more blocks than this number. The runtime system maintains a list of blocks that need to execute and assigns new blocks to SMs as they complete executing the blocks previously assigned to them.
图 4.13线程块分配给 SM。
Figure 4.13 Thread block assignment to SMs.
图 4.13显示了一个示例,其中为每个 SM 分配了三个线程块。 SM 资源限制之一是可以同时跟踪和调度的线程数量。 SM 需要硬件资源来维护线程和块索引并跟踪它们的执行状态。在更新的 CUDA 设备设计中,最多可以为每个 SM 分配 1,536 个线程。这可以采用 6 个块(每个块 256 个线程)、3 个块(每个块 512 个线程)等形式。如果设备在 SM 中只允许最多 8 个块,那么很明显,12 个块(每个块 128 个)每个线程都不是一个可行的选择。如果一个 CUDA 设备有 30 个 SM,每个 SM 最多可容纳 1,536 个线程,则该设备最多可以有 46,080 个线程同时驻留在 CUDA 设备中执行。
Figure 4.13 shows an example in which three thread blocks are assigned to each SM. One of the SM resource limitations is the number of threads that can be simultaneously tracked and scheduled. It takes hardware resources for SMs to maintain the thread and block indices and track their execution status. In more recent CUDA device designs, up to 1,536 threads can be assigned to each SM. This could be in the form of 6 blocks of 256 threads each, 3 blocks of 512 threads each, etc. If the device only allows up to 8 blocks in an SM, it should be obvious that 12 blocks of 128 threads each is not a viable option. If a CUDA device has 30 SMs and each SM can accommodate up to1,536 threads, the device can have up to 46,080 threads simultaneously residing in the CUDA device for execution.
我们关于向块分配执行资源的讨论提出了一个重要问题:我们如何找出可用资源的数量?当CUDA应用程序在系统上执行时,它如何找出设备中的SM数量以及可以分配给每个SM的线程数量?显然,还有其他我们尚未讨论但可能与 CUDA 应用程序的执行相关的资源。一般来说,许多现代应用程序被设计为在各种硬件系统上执行。应用程序通常需要查询底层硬件的可用资源和功能,以利用能力更强的系统,同时补偿能力较差的系统。
Our discussions on assigning execution resources to blocks raise an important question: How do we find out the amount of resources available? When a CUDA application executes on a system, how can it find out the number of SMs in a device and the number of threads that can be assigned to each SM? Obviously, there are also other resources that we have not discussed so far but can be relevant to the execution of a CUDA application. In general, many modern applications are designed to execute on a wide variety of hardware systems. There is often a need for the application to query the available resources and capabilities of the underlying hardware to take advantage of the more capable systems while compensating for the less capable systems.
资源和能力查询
Resource and Capability Queries
在日常生活中,我们经常查询资源和能力。例如,当我们预订酒店时,我们可以查看酒店房间附带的设施。如果房间有吹风机,我们就不用带了。大多数美国酒店客房都配有吹风机,而其他地区的许多酒店则没有。
In everyday life, we often query the resources and capabilities. For example, when we make a hotel reservation, we can check the amenities that come with a hotel room. If the room comes with a hair dryer, we do not need to bring one. Most American hotel rooms come with hair dryers while many hotels in other regions do not have them.
一些亚洲和欧洲酒店提供牙膏甚至牙刷,而大多数美国酒店则不提供。许多美国酒店同时提供洗发水和护发素,而其他大陆的酒店往往只提供洗发水。
Some Asian and European hotels provide toothpastes and even toothbrushes while most American hotels do not. Many American hotels provide both shampoo and conditioner while hotels in other continents often only provide shampoo.
如果房间里有微波炉和冰箱,我们可以把晚餐剩下的东西拿走,第二天再吃。如果酒店有泳池,商务会议结束后我们可以带泳衣去畅游。如果酒店没有游泳池但有健身房,我们可以带跑鞋和运动服。一些亚洲高端酒店甚至提供运动服!
If the room comes with a microwave oven and a refrigerator, we can take the leftover from dinner and expect to eat it the second day. If the hotel has a pool, we can bring swim suits and take a dip after business meetings. If the hotel does not have a pool but has an exercise room, we can bring running shoes and exercise clothes. Some high-end Asian hotels even provide exercise clothing!
这些酒店设施是酒店财产、资源和能力的一部分。经验丰富的旅行者会在酒店网站上查看这些酒店,选择更符合自己需求的酒店,并利用这些信息更高效地打包行李。
These hotel amenities are part of the properties, or resources and capabilities, of the hotels. Veteran travelers check these properties at hotel web sites, choose the hotels that better match their needs, and pack more efficiently and effectively using the information.
在 CUDA C 中,主机代码有一个内置机制来查询系统中可用设备的属性。 CUDA运行时系统有一个API函数cudaGetDeviceCount(),它返回系统中可用的CUDA设备的数量。主机代码可以使用以下语句找出可用 CUDA 设备的数量:
In CUDA C, there is a built-in mechanism for host code to query the properties of the devices available in the system. The CUDA runtime system has an API function cudaGetDeviceCount() that returns the number of available CUDA devices in the system. The host code can find out the number of available CUDA devices using the following statements:
int dev_count;
int dev_count;
cudaGetDeviceCount( &dev_count);
cudaGetDeviceCount( &dev_count);
虽然这可能并不明显,但现代 PC 系统可以轻松拥有两个或更多 CUDA 设备。这是因为许多 PC 系统都配备了一个或多个“集成”GPU。这些 GPU 是默认图形单元,提供基本功能和硬件资源,以便为现代基于窗口的用户界面执行最少的图形功能。大多数 CUDA 应用程序在这些集成设备上的性能都不会很好。这将是主机代码迭代所有可用设备、查询其资源和功能并选择具有足够资源来执行性能令人满意的应用程序的原因。
While it may not be obvious, a modern PC system can easily have two or more CUDA devices. This is because many PC systems come with one or more “integrated” GPUs. These GPUs are the default graphics units and provide rudimentary capabilities and hardware resources to perform minimal graphics functionalities for modern window-based user interfaces. Most CUDA applications will not perform very well on these integrated devices. This would be a reason for the host code to iterate through all the available devices, query their resources and capabilities, and choose the ones that have enough resources to execute the application with satisfactory performance.
CUDA 运行时系统对系统中所有可用设备进行编号,从0到dev_count-1。它提供了一个 API 函数cudaGetDeviceProperties(),该函数返回以编号作为参数给出的设备的属性。例如,我们可以在主机代码中使用以下语句来迭代可用设备并查询其属性:
The CUDA runtime system numbers all the available devices in the system from 0 to dev_count-1. It provides an API function cudaGetDeviceProperties() that returns the properties of the device of which the number is given as an argument. For example, we can use the following statements in the host code to iterate through the available devices and query their properties:
cudaDeviceProp dev_prop;
cudaDeviceProp dev_prop;
for (I = 0; i<dev_count; i++) {
for (I = 0; i<dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
cudaGetDeviceProperties( &dev_prop, i);
// 判断设备是否有足够的资源和能力
// decide if device has sufficient resources and capabilities
}
}
内置类型cudaDeviceProp是一个 C 结构,其字段表示 CUDA 设备的属性。读者可以参考CUDA 编程指南来了解该类型的所有字段。我们将讨论其中一些与执行分配特别相关的字段资源到线程。我们假设属性在dev_prop变量中返回,其中的字段由cudaGetDeviceProperties()函数设置。如果读者选择以不同的方式命名变量,那么在下面的讨论中显然需要替换适当的变量名称。
The built-in type cudaDeviceProp is a C structure with fields that represent the properties of a CUDA device. Readers are referred to the CUDA Programming Guide for all the fields of the type. We will discuss a few of these fields that are particularly relevant to the assignment of execution resources to threads. We assume that the properties are returned in the dev_prop variable of which the fields are set by the cudaGetDeviceProperties() function. If readers choose to name the variable differently, the appropriate variable name will obviously need to be substituted in the following discussion.
顾名思义,字段dev_prop.maxThreadsPerBlock给出了查询设备中的块中允许的最大线程数。某些设备允许每个块中最多有 1,024 个线程,而其他设备则允许更少。未来的设备甚至可能允许每个块超过 1,024 个线程。因此,就应用程序而言,最好查询可用设备并确定哪些设备将在每个块中允许足够数量的线程。
As the name suggests, the field dev_prop.maxThreadsPerBlock gives the maximal number of threads allowed in a block in the queried device. Some devices allow up to 1,024 threads in each block and other devices allow fewer. It is possible that future devices may even allow more than 1,024 threads per block. Therefore, it is a good idea to query the available devices and determine which ones will allow a sufficient number of threads in each block as far as the application is concerned.
设备中 SM 的数量在dev_prop.multiProcessorCount中给出。正如我们之前讨论的,有些设备只有少量的 SM(例如 2 个),有些设备则有大量的 SM(例如 30 个)。如果应用程序需要大量 SM 才能达到令人满意的性能,则一定要检查预期设备的此属性。此外,设备的时钟频率位于dev_prop.clockRate中。时钟速率和 SM 数量的组合可以很好地表明设备的硬件执行能力。
The number of SMs in the device is given in dev_prop.multiProcessorCount. As we discussed earlier, some devices have only a small number of SMs (e.g., 2) and some have a much larger number of SMs (e.g., 30). If the application requires a large number of SMs to achieve satisfactory performance, it should definitely check this property of the prospective device. Furthermore, the clock frequency of the device is in dev_prop.clockRate. The combination of the clock rate and the number of SMs gives a good indication of the hardware execution capacity of the device.
主机代码可以在dev_prop.maxThreadsDim[0](对于x维度)、dev_prop.maxThreadsDim[1](对于y维度)和dev_prop.maxThreadsDim[2]中找到块的每个维度允许的最大线程数。 ](对于z维度)。使用此信息的一个示例是自动调整系统在评估底层硬件的最佳性能块尺寸时设置块尺寸的范围。同样,它可以在dev_prop.maxGridSize[0](对于x维度)、dev_prop.maxGridSize[1](对于y维度)和dev_prop.maxGridSize[2中找到沿网格每个维度允许的最大块数。](对于z维度)。此信息的典型用途是确定网格是否可以有足够的线程来处理整个数据集,或者是否需要某种迭代。
The host code can find the maximal number of threads allowed along each dimension of a block in dev_prop.maxThreadsDim[0] (for the x dimension), dev_prop.maxThreadsDim[1] (for the y dimension), and dev_prop.maxThreadsDim[2] (for the z dimension). An example use of this information is for an automated tuning system to set the range of block dimensions when evaluating the best performing block dimensions for the underlying hardware. Similarly, it can find the maximal number of blocks allowed along each dimension of a grid in dev_prop.maxGridSize[0] (for the x dimension), dev_prop.maxGridSize[1] (for the y dimension), and dev_prop.maxGridSize[2] (for the z dimension). A typical use of this information is to determine whether a grid can have enough threads to handle the entire data set or if some kind of iteration is needed.
cudaDeviceProp结构类型中还有更多字段。我们将在介绍它们旨在反映的概念和功能时讨论它们。
There are many more fields in the cudaDeviceProp structure type. We will discuss them as we introduce the concepts and features that they are designed to reflect.
线程调度严格来说是一个实现概念,因此必须在特定硬件实现的上下文中进行讨论。多数情况迄今为止的实现中,一旦将块分配给 SM,它就会被进一步划分为称为warps的 32 线程单元。扭曲的大小是特定于实现的。事实上,扭曲并不是 CUDA 规范的一部分。然而,扭曲的知识有助于理解和优化特定代 CUDA 设备上的 CUDA 应用程序的性能。 warp 的大小是 CUDA 设备的一个属性,位于设备查询变量(本例中为dev_prop )的dev_prop.warpSize字段中。
Thread scheduling is strictly an implementation concept and thus must be discussed in the context of specific hardware implementations. In most implementations to date, once a block is assigned to a SM, it is further divided into 32-thread units called warps. The size of warps is implementation-specific. In fact, warps are not part of the CUDA specification. However, knowledge of warps can be helpful in understanding and optimizing the performance of CUDA applications on particular generations of CUDA devices. The size of warps is a property of a CUDA device, which is in the dev_prop.warpSize field of the device query variable (dev_prop in this case).
Warp是SM中线程调度的单位。图 4.14显示了实现中将块划分为扭曲的情况。每个 warp 由 32 个具有连续threadIdx值的线程组成:线程 0–31 形成第一个 warp,32–63 形成第二个 warp,依此类推。在此示例中,存在三个块:块 1、块 2 和块 3,全部分配给 SM。出于调度目的,这三个块中的每一个都被进一步划分为 warp。
The warp is the unit of thread scheduling in SMs. Figure 4.14 shows the division of blocks into warps in an implementation. Each warp consists of 32 threads of consecutive threadIdx values: threads 0–31 form the first warp, 32–63 the second warp, and so on. In this example, there are three blocks—block 1, block 2, and block 3, all assigned to an SM. Each of the three blocks is further divided into warps for scheduling purposes.
图 4.14将块划分为 warp 以进行线程调度。
Figure 4.14 Blocks are partitioned into warps for thread scheduling.
我们可以计算给定块大小和分配给每个 SM 的给定块数的 SM 中驻留的扭曲数量。例如,在图4.14中,如果每个块有256个线程,我们可以确定每个块有256÷32或8个warp。每个 SM 中有 3 个块,每个 SM 中有 8×3=24 个扭曲。
We can calculate the number of warps that reside in an SM for a given block size and a given number of blocks assigned to each SM. For example, in Figure 4.14, if each block has 256 threads, we can determine that each block has 256÷32 or 8 warps. With three blocks in each SM, we have 8×3=24 warps in each SM.
SM 被设计为遵循单指令、多数据 (SIMD) 模型来执行 warp 中的所有线程。也就是说,在任何时刻,都会为 warp 中的所有线程获取并执行一条指令。如图 4.14所示,SM 中的执行单元共享单个指令获取/调度。请注意,这些线程会将相同的指令应用于数据的不同部分。因此,warp 中的所有线程将始终具有相同的执行时序。
An SM is designed to execute all threads in a warp following the single instruction, multiple data (SIMD) model. That is, at any instant in time, one instruction is fetched and executed for all threads in the warp. This is illustrated in Figure 4.14 with a single instruction fetch/dispatch shared among execution units in the SM. Note that these threads will apply the same instruction to different portions of the data. As a result, all threads in a warp will always have the same execution timing.
图 4.14还显示了许多实际执行指令的硬件流处理器 (SP)。一般来说,SP 的数量少于分配给每个 SM 的线程数量。也就是说,每个 SM 仅具有足够的硬件来执行在任何时间点分配给 SM 的所有线程的一小部分的指令。在早期的 GPU 设计中,每个 SM 在任何给定时刻只能针对单个扭曲执行一条指令。在最近的设计中,每个 SM 都可以在任何给定时间点执行少量扭曲的指令。无论哪种情况,硬件都可以执行 SM 中所有扭曲的一小部分指令。一个合理的问题是,如果 SM 在任何时刻只能执行其中的一小部分,为什么我们需要在 SM 中拥有如此多的 warp?答案是,这就是 CUDA 处理器如何高效执行全局内存访问等长延迟操作的方式。
Figure 4.14 also shows a number of hardware streaming processors (SPs) that actually execute instructions. In general, there are fewer SPs than the number of threads assigned to each SM. That is, each SM has only enough hardware to execute instructions from a small subset of all threads assigned to the SM at any point in time. In earlier GPU design, each SM can execute only one instruction for a single warp at any given instant. In more recent designs, each SM can execute instructions for a small number of warps at any given point in time. In either case, the hardware can execute instructions for a small subset of all warps in the SM. A legitimate question is, why do we need to have so many warps in an SM if it can only execute a small subset of them at any instant? The answer is that this is how CUDA processors efficiently execute long-latency operations such as global memory accesses.
当线程束中的线程执行的指令需要等待先前启动的长延迟操作的结果时,该线程束不会被选择执行。另一个不再等待结果的常驻扭曲将被选择执行。如果多个 warp 准备好执行,则使用优先级机制来选择一个来执行。这种用其他线程的工作来填充操作延迟时间的机制通常称为延迟容忍或延迟隐藏(请参阅“延迟容忍”边栏)。
When an instruction executed by the threads in a warp needs to wait for the result of a previously initiated long-latency operation, the warp is not selected for execution. Another resident warp that is no longer waiting for results will be selected for execution. If more than one warp is ready for execution, a priority mechanism is used to select one for execution. This mechanism of filling the latency time of operations with work from other threads is often called latency tolerance or latency hiding (see “Latency Tolerance” sidebar).
延迟容忍度
Latency Tolerance
在许多日常情况下也需要延迟容忍度。例如,在邮局,每个试图运送包裹的人最好在前往服务柜台之前填写所有表格和标签。然而,正如我们都经历过的那样,许多人等待服务台职员告诉他们要填写哪张表格以及如何填写表格。
Latency tolerance is also needed in many everyday situations. For example, in post offices, each person trying to ship a package should ideally have filled out all the forms and labels before going to the service counter. However, as we have all experienced, many people wait for the service desk clerk to tell them which form to fill out and how to fill out the form.
当服务台前排起长队时,最大限度地提高服务人员的工作效率非常重要。让一个人在店员面前填写表格,而每个人都在等待,这不是一个好方法。当该人填写表格时,店员应该帮助下一个排队等候的顾客。这些其他客户已“准备好出发”,不应被需要更多时间填写表格的客户阻止。
When there is a long line in front of the service desk, it is important to maximize the productivity of the service clerks. Letting a person fill out the form in front of the clerk while everyone waits is not a good approach. The clerk should be helping the next customers who are waiting in line while the person fills out the form. These other customers are “ready to go” and should not be blocked by the customer who needs more time to fill out a form.
这就是为什么一个好的店员会礼貌地要求第一位顾客退到一边填写表格,而他或她可以为其他顾客服务。在大多数情况下,第一个顾客填完表格后,店员就会为当前顾客提供服务,而不是走到队伍的末尾。
This is why a good clerk would politely ask the first customer to step aside to fill out the form while he or she can serve other customers. In most cases, the first customer will be served as soon as he or she finishes the form and the clerk finishes serving the current customer, instead of going to the end of the line.
我们可以将这些邮局客户视为扭曲者,将职员视为硬件执行单元。需要填写表格的客户对应于一个warp,其持续执行依赖于长延迟操作。
We can think of these post office customers as warps and the clerk as a hardware execution unit. The customer who needs to fill out the form corresponds to a warp of which the continued execution is dependent on a long-latency operation.
请注意,warp 调度还用于容忍其他类型的操作延迟,例如流水线浮点算术和分支指令。有了足够的扭曲,硬件可能会在任何时间点找到一个扭曲来执行,从而尽管有这些长时间延迟的操作,也可以充分利用执行硬件。选择准备执行的warp不会将任何空闲时间引入执行时间线,这被称为零开销线程调度。通过 warp 调度,warp 指令的漫长等待时间通过执行其他 warp 的指令来“隐藏”。这种容忍长操作延迟的能力是 GPU 不像 CPU 那样将尽可能多的芯片面积用于缓存和分支预测机制的主要原因。因此,GPU 可以将更多的芯片面积专门用于浮点执行资源。
Note that warp scheduling is also used for tolerating other types of operation latencies such as pipelined floating-point arithmetic and branch instructions. With enough warps around, the hardware will likely find a warp to execute at any point in time, thus making full use of the execution hardware in spite of these long-latency operations. The selection of ready warps for execution does not introduce any idle time into the execution timeline, which is referred to as zero-overhead thread scheduling. With warp scheduling, the long waiting time of warp instructions is “hidden” by executing instructions from other warps. This ability to tolerate long operation latencies is the main reason why GPUs do not dedicate nearly as much chip area to cache memories and branch prediction mechanisms as CPUs. As a result, GPUs can dedicate more of its chip area to floating-point execution resources.
我们现在准备做一个简单的练习。3假设 CUDA 设备允许每个 SM 最多 8 个块和 1,024 个线程,以首先成为限制的为准。此外,它允许每个块中最多有 512 个线程。对于矩阵-矩阵乘法,我们应该使用 8×8、16×16 还是 32×32 线程块?为了回答这个问题,我们可以分析每种选择的利弊。如果我们使用 8×8 块,每个块将只有 64 个线程。我们需要 1,024÷64=12 个区块才能完全占据一个 SM。然而,由于每个 SM 中最多有 8 个块的限制,因此每个 SM 中最终只有 64×8=512 个线程。这意味着 SM 执行资源可能未得到充分利用,因为围绕长延迟操作安排的扭曲会减少。
We are now ready to do a simple exercise.3 Assume that a CUDA device allows up to 8 blocks and 1,024 threads per SM, whichever becomes a limitation first. Furthermore, it allows up to 512 threads in each block. For matrix–matrix multiplication, should we use 8×8, 16×16, or 32×32 thread blocks? To answer the question, we can analyze the pros and cons of each choice. If we use 8×8 blocks, each block would have only 64 threads. We will need 1,024÷64=12 blocks to fully occupy an SM. However, since there is a limitation of up to 8 blocks in each SM, we will end up with only 64×8=512 threads in each SM. This means that the SM execution resources will likely be underutilized because there will be fewer warps to schedule around long-latency operations.
16×16 块为每个块提供 256 个线程。这意味着每个 SM 可以占用 1,024÷256=4 个区块。这在 8 块限制之内。这是一个很好的配置,因为我们将在每个 SM 中拥有完整的线程容量以及用于围绕长延迟操作进行调度的最大数量的扭曲。 32×32 块将在每个块中提供 1,024 个线程,超出了该设备每块 512 个线程的限制。
The 16×16 blocks give 256 threads per block. This means that each SM can take 1,024÷256=4 blocks. This is within the 8-block limitation. This is a good configuration since we will have full thread capacity in each SM and a maximal number of warps for scheduling around the long-latency operations. The 32×32 blocks would give 1,024 threads in each block, exceeding the limit of 512 threads per block for this device.
内核执行配置定义了网格及其块的维度。blockIdx和threadIdx变量中的唯一坐标允许网格线程识别自身及其数据域。程序员有责任在内核函数中使用这些变量,以便线程可以正确识别要处理的数据部分。这种编程模型迫使程序员将线程及其数据组织成分层和多维的组织。
The kernel execution configuration defines the dimensions of a grid and its blocks. Unique coordinates in blockIdx and threadIdx variables allow threads of a grid to identify themselves and their domains of data. It is the programmer’s responsibility to use these variables in kernel functions so that the threads can properly identify the portion of the data to process. This model of programming compels the programmer to organize threads and their data into hierarchical and multidimensional organizations.
一旦网格启动,其块就会以任意顺序分配给 SM,从而实现 CUDA 应用程序的透明可扩展性。透明的可扩展性有一个限制:不同块中的线程无法彼此同步。为了让内核保持透明的可伸缩性,不同块中的线程相互同步的简单方法是终止内核并为同步点之后的活动启动新的内核。
Once a grid is launched, its blocks are assigned to SMs in arbitrary order, resulting in transparent scalability of CUDA applications. The transparent scalability comes with a limitation: threads in different blocks cannot synchronize with each other. To allow a kernel to maintain transparent scalability, the simple way for threads in different blocks to synchronize with each other is to terminate the kernel and start a new kernel for the activities after the synchronization point.
线程被分配给 SM 来逐块执行。每个 CUDA 设备对每个 SM 中的可用资源量施加潜在不同的限制。例如,每个 CUDA 设备对其线程块数量和每个 SM 可容纳的线程数量都有限制,以首先成为限制的为准。对于每个内核,这些资源限制中的一个或多个可能成为同时驻留在 CUDA 设备中的线程数量的限制因素。
Threads are assigned to SMs for execution on a block-by-block basis. Each CUDA device imposes a potentially different limitation on the amount of resources available in each SM. For example, each CUDA device has a limit on the number of thread blocks and the number of threads each of its SMs can accommodate, whichever becomes a limitation first. For each kernel, one or more of these resource limitations can become the limiting factor for the number of threads that simultaneously reside in a CUDA device.
一旦一个块被分配给 SM,它就会被进一步划分为 warp。 warp 中的所有线程都具有相同的执行时序。在任何时候,SM 都只执行其常驻扭曲的一小部分指令。这允许其他线程等待长延迟操作,而不会减慢大量执行单元的整体执行吞吐量。
Once a block is assigned to an SM, it is further partitioned into warps. All threads in a warp have identical execution timing. At any time, the SM executes instructions of only a small subset of its resident warps. This allows the other warps to wait for long-latency operations without slowing down the overall execution throughput of the massive number of execution units.
4.1 如果 CUDA 设备的 SM 最多可以占用 1,536 个线程和最多 4 个线程块,那么以下哪种块配置会导致 SM 中的线程数最多?
4.1 If a CUDA device’s SM can take up to 1,536 threads and up to 4 thread blocks, which of the following block configurations would result in the most number of threads in the SM?
4.2 对于向量加法,假设向量长度为2000,每个线程计算一个输出元素,线程块大小为512个线程。网格中有多少个线程?
4.2 For a vector addition, assume that the vector length is 2,000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid?
4.3 对于上一个问题,由于矢量长度的边界检查,您预计有多少扭曲会出现发散?
4.3 For the previous question, how many warps do you expect to have divergence due to the boundary check on the vector length?
4.4 您需要编写一个对 400×900 像素大小的图像进行操作的内核。您想为每个像素分配一个线程。您希望线程块是方形的,并使用设备上可能的每个块的最大线程数(您的设备具有计算能力 3.0)。您将如何选择内核的网格尺寸和块尺寸?
4.4 You need to write a kernel that operates on an image of size 400×900 pixels. You would like to assign one thread to each pixel. You would like your thread blocks to be square and to use the maximum number of threads per block possible on the device (your device has compute capability 3.0). How would you select the grid dimensions and block dimensions of your kernel?
4.5 For the previous question, how many idle threads do you expect to have?
4.6 考虑一个假设的块,其中有 8 个线程在到达屏障之前执行一段代码。线程需要以下时间量(以微秒为单位)来执行这些部分:2.0、2.3、3.0、2.8、2.4、1.9、2.6、2.9,并花费其余时间等待屏障。等待屏障所花费的线程总执行时间的百分比是多少?
4.6 Consider a hypothetical block with 8 threads executing a section of code before reaching a barrier. The threads require the following amount of time (in microseconds) to execute the sections: 2.0, 2.3, 3.0, 2.8, 2.4, 1.9, 2.6, 2.9, and spend the rest of their time waiting for the barrier. What percentage of the threads’ summed-up execution times is spent waiting for the barrier?
4.7 指示每个多处理器可以进行以下哪些分配。如果不可能,请指出限制因素。
4.7 Indicate which of the following assignments per multiprocessor is possible. In the case where it is not possible, indicate the limiting factor(s).
A。 8 个块,每个块有 128 个线程,位于计算能力 1.0 的设备上
a. 8 blocks with 128 threads each on a device with compute capability 1.0
b. 8 个块,每个块有 128 个线程,位于计算能力 1.2 的设备上
b. 8 blocks with 128 threads each on a device with compute capability 1.2
C。 8 个块,每个块有 128 个线程,位于计算能力 3.0 的设备上
c. 8 blocks with 128 threads each on a device with compute capability 3.0
d. 计算能力 1.0 的设备上有 16 个块,每个块有 64 个线程
d. 16 blocks with 64 threads each on a device with compute capability 1.0
e. 具有计算能力 1.2 的设备上的 16 个块,每个块有 64 个线程
e. 16 blocks with 64 threads each on a device with compute capability 1.2
F。 具有计算能力 3.0 的设备上的 16 个块,每个块有 64 个线程
f. 16 blocks with 64 threads each on a device with compute capability 3.0
4.8 CUDA 程序员表示,如果他们启动每个块中只有 32 个线程的内核,则可以在需要屏障同步的地方省略__syncthreads()指令。你认为这是个好主意吗?解释。
4.8 A CUDA programmer says that if they launch a kernel with only 32 threads in each block, they can leave out the __syncthreads() instruction wherever barrier synchronization is needed. Do you think this is a good idea? Explain.
4.9 一名学生提到,他能够使用具有 32×32 线程块的平铺矩阵乘法代码来将两个 1,024×1,024 矩阵相乘。他使用的 CUDA 设备允许每个块最多 512 个线程,每个 SM 最多 8 个块。他进一步提到,线程块中的每个线程都会计算结果矩阵的一个元素。你的反应是什么?为什么?
4.9 A student mentioned that he was able to multiply two 1,024×1,024 matrices using a tiled matrix multiplication code with 32×32 thread blocks. He is using a CUDA device that allows up to 512 threads per block and up to 8 blocks per SM. He further mentioned that each thread in a thread block calculates one element of the result matrix. What would be your reaction and why?
4.10 以下内核在一个大矩阵上执行,该矩阵被平铺为子矩阵。为了操作图块,新的 CUDA 程序员编写了以下设备内核,它将转置矩阵中的每个图块。这些图块的大小为BLOCK_WIDTH x BLOCK_WIDTH,并且已知矩阵 A 的每个维度都是BLOCK_WIDTH的倍数。内核调用和代码如下所示。BLOCK_WIDTH在编译时已知,但可以设置为 1 到 20 之间的任何值。
dim3 blockDim(BLOCK_WIDTH,BLOCK_WIDTH);
dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
BlockTranspose<<<gridDim, blockDim>>>(A, A_width, A_height);
__全局__ 无效
BlockTranspose(float* A_elements, int A_width, int A_height)
{
__shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];
int baseIdx = blockIdx.x * BLOCK_SIZE + threadIdx.x;
baseIdx+ = (blockIdx.y * BLOCK_SIZE + threadIdx.y) * A_width;
blockA[threadIdx.y][threadIdx.x] = A_elements[baseIdx];
A_elements[baseIdx] = blockA[threadIdx.x][threadIdx.y];
}
4.10 The following kernel is executed on a large matrix, which is tiled into submatrices. To manipulate tiles, a new CUDA programmer has written the following device kernel, which will transpose each tile in the matrix. The tiles are of size BLOCK_WIDTH by BLOCK_WIDTH, and each of the dimensions of matrix A is known to be a multiple of BLOCK_WIDTH. The kernel invocation and code are shown below. BLOCK_WIDTH is known at compile time, but could be set anywhere from 1 to 20.
dim3 blockDim(BLOCK_WIDTH,BLOCK_WIDTH);
dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
BlockTranspose<<<gridDim, blockDim>>>(A, A_width, A_height);
__global__ void
BlockTranspose(float∗ A_elements, int A_width, int A_height)
{
__shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];
int baseIdx = blockIdx.x ∗ BLOCK_SIZE + threadIdx.x;
baseIdx+ = (blockIdx.y ∗ BLOCK_SIZE + threadIdx.y) ∗ A_width;
blockA[threadIdx.y][threadIdx.x] = A_elements[baseIdx];
A_elements[baseIdx] = blockA[threadIdx.x][threadIdx.y];
}
A。 在BLOCK_SIZE的可能值范围之外,在设备上执行时,对于什么BLOCK_SIZE值,该内核可以正确运行?
a. Out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will this kernel function correctly when executing on the device?
b. 如果代码无法针对所有BLOCK_SIZE值正确执行,请建议修复代码以使其适用于所有BLOCK_SIZE值。
b. If the code does not execute correctly for all BLOCK_SIZE values, suggest a fix to the code to make it work for all BLOCK_SIZE values.
1功能级别低于 2.0 的设备支持具有最多 2D 块阵列的网格。
1Devices with capability level less than 2.0 support grids with up to 2D arrays of blocks.
2能力低于 2.0 的设备允许具有最多 512 个线程的块。
2Devices with capability less than 2.0 allow blocks with up to 512 threads.
3请注意,这是一个过于简单化的练习。正如我们将在第 5 章中解释的那样,在确定最合适的块尺寸时,还必须考虑其他资源的使用,例如寄存器和共享内存。此练习重点介绍了线程块数量限制和可分配给每个 SM 的线程数量限制之间的相互作用。
3Note that this is an oversimplified exercise. As we will explain in Chapter 5, the usage of other resources, such as registers and shared memory, must also be considered when determining the most appropriate block dimensions. This exercise highlights the interactions between the limit on the number of thread blocks and the limit on the number of threads that can be assigned to each SM.
5.1 内存访问效率的重要性
5.1 Importance of Memory Access Efficiency
5.2 CUDA 设备内存类型
5.2 CUDA Device Memory Types
5.3 减少全局内存流量的策略
5.3 A Strategy for Reducing Global Memory Traffic
5.4 平铺矩阵-矩阵乘法内核
5.4 A Tiled Matrix–Matrix Multiplication Kernel
5.5 内存是并行性的限制因素
5.5 Memory as a Limiting Factor to Parallelism
5.6 概括
5.6 Summary
5.7 练习
5.7 Exercises
到目前为止,我们已经学会了编写一个由大量线程执行的 CUDA 内核函数。这些线程要处理的数据首先从主机内存传输到设备全局内存。然后,线程使用其块 ID 和线程 ID 从全局内存访问其数据部分。我们还了解了执行线程的分配和调度的更多细节。尽管这是一个非常好的开始,但这些简单的 CUDA 内核可能只能实现底层硬件潜在速度的一小部分。性能不佳的原因是,通常使用动态随机存取存储器 (DRAM) 实现的全局存储器往往具有较长的访问延迟(数百个时钟周期)和有限的访问带宽。虽然拥有许多可供执行的线程理论上可以容忍较长的内存访问延迟,但人们很容易遇到这样一种情况:全局内存访问路径中的流量拥塞阻止除极少数线程之外的所有线程取得进展,从而导致一些流式多处理器(SM) ) 闲置的。为了避免这种拥塞,CUDA 提供了许多额外的内存访问方法,可以消除对全局内存的大部分数据请求。在本章中,您将学习如何使用这些内存来提高 CUDA 内核的执行效率。
So far, we have learned to write a CUDA kernel function that is executed by a massive number of threads. The data to be processed by these threads is first transferred from the host memory to the device global memory. The threads then access their portion of the data from the global memory using their block IDs and thread IDs. We have also learned more details of the assignment and scheduling of threads for execution. Although this is a very good start, these simple CUDA kernels will likely achieve only a small fraction of the potential speed of the underlying hardware. The poor performance is due to the fact that global memory, which is typically implemented with dynamic random access memory (DRAM), tends to have long access latencies (hundreds of clock cycles) and finite access bandwidth. While having many threads available for execution can theoretically tolerate long memory access latencies, one can easily run into a situation where traffic congestion in the global memory access paths prevents all but very few threads from making progress, thus rendering some of the streaming multiprocessors (SMs) idle. To circumvent such congestion, CUDA provides a number of additional methods for accessing memory that can remove the majority of data requests to the global memory. In this chapter, you will learn to use these memories to boost the execution efficiency of CUDA kernels.
我们可以通过计算图 4.7中矩阵乘法内核代码的预期性能水平(如图5.1所示)来说明内存访问效率的影响。就执行时间而言,内核中最重要的部分是执行内积计算的for循环。
We can illustrate the effect of memory access efficiency by calculating the expected performance level of the matrix multiplication kernel code in Figure 4.7, replicated in Figure 5.1. The most important part of the kernel in terms of execution time is the for loop that performs inner product calculation.
图 5.1一个简单的矩阵-矩阵乘法内核,使用一个线程来计算每个d_P元素(复制自图 4.7)。
Figure 5.1 A simple matrix–matrix multiplication kernel using one thread to compute each d_P element (copied from Figure 4.7).
for (int k = 0; k < 宽度; ++k)
for (int k = 0; k < Width; ++k)
P值+= d_M[行*宽度+k] * d_N[k*宽度+列];
Pvalue += d_M[Row∗Width+k] ∗ d_N[k∗Width+Col];
在此循环的每次迭代中,都会执行两次全局内存访问,以进行一次浮点乘法和一次浮点加法。一个全局内存访问获取d_M[]元素,另一个全局内存访问获取d_N[]元素。一个浮点运算将获取的d_M[]和d_N[]元素相乘,另一个浮点运算将乘积累加到Pvalue中。因此,浮点计算与全局内存访问操作的比例为1:1,即1.0。我们将此比率称为计算与全局内存访问 (CGMA) 比率,定义为每次访问 CUDA 程序区域内的全局内存时执行的浮点计算数量。
In every iteration of this loop, two global memory accesses are performed for one floating-point multiplication and one floating-point addition. One global memory access fetches a d_M[] element and the other fetches a d_N[] element. One floating-point operation multiplies the d_M[] and d_N[] elements fetched and the other accumulates the product into Pvalue. Thus, the ratio of floating-point calculation to global memory access operation is 1:1, or 1.0. We will refer to this ratio as the compute to global memory access (CGMA) ratio, defined as the number of floating-point calculations performed for each access to the global memory within a region of a CUDA program.
CGMA 对 CUDA 内核的性能有重大影响。在当今的高端设备中,全局内存带宽约为 200 GB/s。每个单精度浮点值有 4 个字节,预计每秒加载的单精度操作数不会超过 50 (200/4) 千兆。 CGMA 比率为 1.0 时,矩阵乘法内核每秒执行不超过 50 次千兆浮点运算 (GFLOPS)。虽然 50 GFLOPS 是一个值得尊敬的数字,但对于这些高端设备来说,它只是 1,500 GFLOPS 或更高的峰值单精度性能的一小部分。我们需要增加 CGMA 比率来为内核实现更高水平的性能。为了使矩阵乘法代码达到处理器的峰值 1,500 GFLOPS 额定值,我们需要 CGMA 值为 30。在过去三代设备中,所需的 CGMA 比率大约增加了一倍。
CGMA has major implications on the performance of a CUDA kernel. In a high-end device today, the global memory bandwidth is around 200 GB/s. With 4 bytes in each single-precision floating-point value, one can expect to load no more than 50 (200/4) giga single-precision operands per second. With a CGMA ration of 1.0, the matrix multiplication kernel will execute no more than 50 giga floating-point operations per second (GFLOPS). While 50 GFLOPS is a respectable number, it is only a tiny fraction of the peak single-precision performance of 1,500 GFLOPS or higher for these high-end devices. We need to increase the CGMA ratio to achieve a higher level of performance for the kernel. For the matrix multiplication code to achieve the peak 1,500 GFLOPS rating of the processor, we need a CGMA value of 30. The desired CGMA ratio has roughly doubled in the past three generations of devices.
冯诺依曼模型
The von Neumann Model
约翰·冯·诺依曼 (John von Neumann) 在 1945 年的开创性报告中描述了一种基于开创性 EDVAC 计算机设计的电子计算机模型。该模型现在通常称为冯·诺依曼模型,几乎是所有现代计算机的基础蓝图。
In his seminal 1945 report, John von Neumann described a model for building electronic computers that is based on the design of the pioneering EDVAC computer. This model, now commonly referred to as the von Neumann model, has been the foundational blueprint for virtually all modern computers.
此处说明了冯·诺依曼模型。计算机具有 I/O,允许向系统提供程序和数据并从系统生成程序和数据。为了执行程序,计算机首先将程序及其数据输入到内存中。
The von Neumann model is illustrated here. The computer has an I/O that allows both programs and data to be provided to and generated from the system. To execute a program, the computer first inputs the program and its data into the memory.
该程序由指令集合组成。控制单元维护一个程序计数器(PC),其中包含下一条要执行的指令的内存地址。在每个“指令周期”中,控制单元使用 PC 将指令读取到指令寄存器 (IR) 中。然后,指令位用于确定计算机所有组件要采取的操作。这就是该模型也被称为“存储程序”模型的原因,这意味着用户可以通过将不同的程序存储到计算机的内存中来改变计算机的操作。
The program consists of a collection of instructions. The control unit maintains a program counter (PC), which contains the memory address of the next instruction to be executed. In each “instruction cycle,” the control unit uses the PC to fetch an instruction into the instruction register (IR). The instruction bits are then used to determine the action to be taken by all components of the computer. This is the reason why the model is also called the “stored program” model, which means that a user can change the actions of a computer by storing a different program into its memory.
CUDA 支持多种类型的内存,程序员可以使用这些内存来实现高 CGMA 比率,从而在内核中实现高执行速度。图 5.2显示了这些 CUDA 设备存储器。在图的底部,我们看到全局内存和常量内存。主机可以通过调用 API 函数对这些类型的存储器进行写入 (W) 和读取 (R)。1我们已经在第 3 章中介绍了全局内存。当所有线程同时访问同一位置时,恒定内存支持设备的短延迟、高带宽、只读访问。
CUDA supports several types of memory that can be used by programmers to achieve a high CGMA ratio and thus a high execution speed in their kernels. Figure 5.2 shows these CUDA device memories. At the bottom of the figure, we see global memory and constant memory. These types of memory can be written (W) and read (R) by the host by calling API functions.1 We have already introduced global memory in Chapter 3. The constant memory supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location.
图 5.2 CUDA 设备内存模型概述。
Figure 5.2 Overview of the CUDA device memory model.
图5.2中的寄存器和共享存储器是片上存储器。可以以高度并行的方式以非常高的速度访问驻留在这些类型的存储器中的变量。寄存器被分配给各个线程;每个线程只能访问自己的寄存器。内核函数通常使用寄存器来保存经常访问的变量每个线程私有。共享内存分配给线程块;块中的所有线程都可以访问分配给该块的共享内存位置中的变量。共享内存是线程通过共享输入数据和工作中间结果进行协作的有效手段。通过在一种 CUDA 内存类型中声明 CUDA 变量,CUDA 程序员可以规定该变量的可见性和访问速度。
Registers and shared memory in Figure 5.2 are on-chip memories. Variables that reside in these types of memory can be accessed at very high speed in a highly parallel manner. Registers are allocated to individual threads; each thread can only access its own registers. A kernel function typically uses registers to hold frequently accessed variables that are private to each thread. Shared memory is allocated to thread blocks; all threads in a block can access variables in the shared memory locations allocated to the block. Shared memory is an efficient means for threads to cooperate by sharing their input data and the intermediate results of their work. By declaring a CUDA variable in one of the CUDA memory types, a CUDA programmer dictates the visibility and access speed of the variable.
为了充分理解寄存器、共享内存和全局内存之间的区别,我们需要更详细地了解这些不同类型的内存在现代处理器中是如何实现和使用的。 CUDA 编程模型中的全局内存映射到冯诺依曼模型的内存(请参阅“冯诺依曼模型”边栏)。图 5.3中的处理器框对应于我们今天通常看到的处理器芯片边界。全局存储器脱离处理器芯片并采用 DRAM 技术实现,这意味着较长的访问延迟和相对较低的访问带宽。这些寄存器对应于冯诺依曼模型的“寄存器文件”。它位于处理器芯片上,这意味着非常短的访问延迟和显着更高的访问带宽。在典型的设备中,寄存器文件的聚合访问带宽大约是全局存储器的两个数量级。此外,每当变量存储在寄存器中时,其访问不再消耗片外全局存储器带宽。这将反映为 CGMA 比率的增加。
To fully appreciate the difference between registers, shared memory, and global memory, we need to go into a little more detail of how these different types of memories are realized and used in modern processors. The global memory in the CUDA programming model maps to the memory of the von Neumann model (see “The von Neumann Model” sidebar). The processor box in Figure 5.3 corresponds to the processor chip boundary that we typically see today. The global memory is off the processor chip and is implemented with DRAM technology, which implies long access latencies and relatively low access bandwidth. The registers correspond to the “register file” of the von Neumann model. It is on the processor chip, which implies very short access latency and drastically higher access bandwidth. In a typical device, the aggregated access bandwidth of the register files is about two orders of magnitude of that of the global memory. Furthermore, whenever a variable is stored in a register, its accesses no longer consume off-chip global memory bandwidth. This will be reflected as an increase in the CGMA ratio.
图 5.3基于冯诺依曼模型的现代计算机中的内存与寄存器。
Figure 5.3 Memory versus registers in a modern computer based on the von Neumann model.
更微妙的一点是,每次访问寄存器所涉及的指令比全局内存少。在图 5.3中,处理器使用 PC 值将指令从内存提取到 IR 中(请参阅“冯·诺依曼模型”边栏)。然后,所获取的指令的位被用来控制计算机组件的活动。使用指令位来控制计算机的活动称为指令执行。每个时钟周期内可以获取和执行的指令数量是有限的。因此,程序需要执行的指令越多,执行程序所需的时间就越多。
A more subtle point is that each access to registers involves fewer instructions than global memory. In Figure 5.3, the processor uses the PC value to fetch instructions from memory into the IR (see “The von Neumann Model” sidebar). The bits of the fetched instructions are then used to control the activities of the components of the computer. Using the instruction bits to control the activities of the computer is referred to as instruction execution. The number of instructions that can be fetched and executed in each clock cycle is limited. Therefore, the more instructions that need to be executed for a program, the more time it can take to execute the program.
大多数现代处理器中的算术指令都有“内置”寄存器操作数。例如,典型的浮点加法指令的形式为
Arithmetic instructions in most modern processors have “built-in” register operands. For example, a typical floating addition instruction is of the form
时尚 r1、r2、r3
fadd r1, r2, r3
其中r2和r3是寄存器编号,指定寄存器文件中可以找到输入操作数值的位置。浮点加法结果值的存储位置由r1指定。因此,当算术指令的操作数位于寄存器中时,不需要额外的指令来使操作数值可供完成算术计算的算术和逻辑单元(ALU)使用。
where r2 and r3 are the register numbers that specify the location in the register file where the input operand values can be found. The location for storing the floating-point addition result value is specified by r1. Therefore, when an operand of an arithmetic instruction is in a register, there is no additional instruction required to make the operand value available to the arithmetic and logic unit (ALU) where the arithmetic calculation is done.
另一方面,如果操作数值位于全局存储器中,则需要执行存储器加载操作以使操作数值可供ALU使用。例如,如果浮点加法指令的第一个操作数位于当今典型计算机的全局存储器中,则所涉及的指令可能是
On the other hand, if an operand value is in global memory, one needs to perform a memory load operation to make the operand value available to the ALU. For example, if the first operand of a floating-point addition instruction is in global memory of a typical computer today, the instructions involved will likely be
负载 r2、r4、偏移
load r2, r4, offset
时尚 r1、r2、r3
fadd r1, r2, r3
其中加载指令将偏移值添加到r4的内容以形成操作数值的地址。然后它访问全局内存并将值放入寄存器r2中。fadd指令然后使用r2和r3中的值执行浮点加法,并将结果放入r1中。由于处理器每个时钟周期只能获取和执行有限数量的指令,因此具有额外负载的版本可能会比没有负载的版本花费更多的时间来处理。这也是为什么将操作数放在寄存器中可以提高执行速度的另一个原因。
where the load instruction adds an offset value to the contents of r4 to form an address for the operand value. It then accesses the global memory and places the value into register r2. The fadd instruction then performs the floating addition using the values in r2 and r3 and places the result into r1. Since the processor can only fetch and execute a limited number of instructions per clock cycle, the version with an additional load will likely take more time to process than the one without. This is another reason why placing the operands in registers can improve execution speed.
处理单元和线程
Processing Units and Threads
现在我们已经介绍了冯·诺依曼模型,我们准备讨论线程是如何实现的。现代计算机中的线程是虚拟化的冯诺依曼处理器。回想一下,线程由程序代码、正在执行的代码中的特定点以及其变量和数据结构的值组成。
Now that we have introduced the von Neumann model, we are ready to discuss how threads are implemented. A thread in modern computers is a virtualized von Neumann processor. Recall that a thread consists of the code of a program, the particular point in the code that is being executed, and the value of its variables and data structures.
在基于冯·诺依曼模型的计算机中,程序的代码存储在内存中。 PC 跟踪正在执行的程序的特定点。 IR 保存从点执行中获取的指令。寄存器和存储器保存变量和数据结构的值。
In a computer based on the von Neumann model, the code of the program is stored in the memory. The PC keeps track of the particular point of the program that is being executed. The IR holds the instruction that is fetched from the point execution. The register and memory hold the values of the variables and data structures.
现代处理器的设计允许上下文切换,其中多个线程可以通过轮流取得进展来分时共享处理器。通过仔细保存和恢复PC值以及寄存器和内存的内容,我们可以暂停线程的执行并在稍后正确地恢复线程的执行。
Modern processors are designed to allow context switching, where multiple threads can timeshare a processor by taking turns to make progress. By carefully saving and restoring the PC value and the contents of registers and memory, we can suspend the execution of a thread and correctly resume the execution of the thread later.
一些处理器提供多个处理单元,允许多个线程同时进行。图 5.4显示了单指令、多数据 (SIMD) 设计风格,其中所有处理单元共享 PC 和 IR。在这种设计下,所有同时进行的线程都执行程序中的相同指令。
Some processors provide multiple processing units, which allow multiple threads to make simultaneous progress. Figure 5.4 shows a single instruction, multiple data (SIMD) design style where all processing units share a PC and IR. Under this design, all threads making simultaneous progress execute the same instruction in the program.
图 5.4 CUDA 设备 SM 中的共享内存与寄存器。
Figure 5.4 Shared memory versus registers in a CUDA device SM.
最后,还有另一个微妙的原因为什么最好将操作数值放在寄存器中。在现代计算机中,从寄存器堆访问值所消耗的能量比从全局存储器访问值所消耗的能量至少低一个数量级。我们很快就会了解现代计算机中访问这两种硬件结构的速度和能量差异的更多细节。然而,我们很快就会了解到,在当今的 GPU 中,每个线程可用的寄存器数量非常有限。我们需要小心,不要过度订阅这一有限的资源。
Finally, there is another subtle reason why placing an operand value in registers is preferable. In modern computers, the energy consumed for accessing a value from the register file is at least an order of magnitude lower than for accessing a value from the global memory. We will look at more details of the speed and energy difference in accessing these two hardware structures in modern computers soon. However, as we will soon learn, the number of registers available to each thread is quite limited in today’s GPUs. We need to be careful not to oversubscribe to this limited resource.
图 5.4显示了 CUDA 设备中的共享内存和寄存器。尽管两者都是片上存储器,但它们在功能和访问成本方面存在显着差异。共享内存被设计为驻留在处理器芯片上的内存空间的一部分(参见第 4.2 节)。当处理器访问驻留在共享内存中的数据时,它需要执行内存加载操作,就像访问全局内存中的数据一样。然而,由于共享内存驻留在片上,因此与全局内存相比,它的访问延迟要低得多,带宽要高得多。由于需要执行加载操作,共享内存比寄存器具有更长的延迟和更低的带宽。在计算机体系结构中,共享内存是暂存器内存的一种形式。
Figure 5.4 shows shared memory and registers in a CUDA device. Although both are on-chip memories, they differ significantly in functionality and cost of access. Shared memory is designed as part of the memory space that resides on the processor chip (see Section 4.2). When the processor accesses data that resides in the shared memory, it needs to perform a memory load operation, just like accessing data in the global memory. However, because shared memory resides on-chip, it can be accessed with much lower latency and much higher bandwidth than the global memory. Because of the need to perform a load operation, share memory has longer latency and lower bandwidth than registers. In computer architecture, share memory is a form of scratchpad memory.
CUDA 中的共享内存和寄存器之间的一个重要区别是,驻留在共享内存中的变量可由块中的所有线程访问。这与寄存器数据相反,寄存器数据是线程私有的。也就是说,共享内存旨在支持块中线程之间高效、高带宽的数据共享。如图5.4所示,CUDA 设备 SM 通常采用多个处理单元(在图 4.14中称为 SP),以允许多个线程同时进行(请参阅“处理单元和线程”边栏)。块中的线程可以分布在这些处理单元上。因此,这些 CUDA 设备中共享内存的硬件实现通常被设计为允许多个处理单元同时访问其内容,以支持块中线程之间的高效数据共享。我们将学习几种重要类型的并行算法,这些算法可以从线程之间有效的数据共享中受益匪浅。
One important difference between the share memory and registers in CUDA is that variables that reside in the shared memory are accessible by all threads in a block. This is in contrast to register data, which is private to a thread. That is, shared memory is designed to support efficient, high-bandwidth sharing of data among threads in a block. As shown in Figure 5.4, a CUDA device SM typically employs multiple processing units, referred to as SPs in Figure 4.14, to allow multiple threads to make simultaneous progress (see “Processing Units and Threads” sidebar). Threads in a block can be spread across these processing units. Therefore, the hardware implementations of shared memory in these CUDA devices are typically designed to allow multiple processing units to simultaneously access its contents to support efficient data sharing among threads in a block. We will be learning several important types of parallel algorithms that can greatly benefit from such efficient data sharing among threads.
现在应该清楚了,寄存器、共享内存和全局内存都具有不同的功能、延迟和带宽。因此,了解如何声明变量以使其驻留在预期类型的内存中非常重要。表 5.1介绍了将程序变量声明到各种类型的设备内存中的 CUDA 语法。每个这样的声明还为其声明的 CUDA 变量提供了范围和生命周期。范围标识了可以访问该变量的线程范围:仅由单个线程,由块的所有线程,或由所有网格的所有线程。如果变量的作用域是单线程,则将为每个线程创建该变量的私有版本;每个线程只能访问该变量的私有版本。例如,如果内核声明了一个变量,其范围是一个线程,并且使用一百万个线程启动该变量,则将创建该变量的一百万个版本,以便每个线程初始化并使用自己的变量版本。
It should be clear by now that registers, shared memory, and global memory all have different functionalities, latencies, and bandwidth. It is, therefore, important to understand how to declare a variable so that it will reside in the intended type of memory. Table 5.1 presents the CUDA syntax for declaring program variables into the various types of device memory. Each such declaration also gives its declared CUDA variable a scope and lifetime. Scope identifies the range of threads that can access the variable: by a single thread only, by all threads of a block, or by all threads of all grids. If a variable’s scope is a single thread, a private version of the variable will be created for every thread; each thread can only access its private version of the variable. For example, if a kernel declares a variable of which the scope is a thread and it is launched with one million threads, one million versions of the variable will be created so that each thread initializes and uses its own version of the variable.
表 5.1 CUDA 变量类型限定符
Table 5.1 CUDA Variable Type Qualifiers
生命周期告诉变量可供使用时程序执行持续时间的一部分:无论是在内核执行中还是在整个应用程序中。如果变量的生命周期在内核执行期间,则必须在内核函数体内声明它,并且只能由内核代码使用。如果多次调用内核,则在这些调用中不会维护该变量的值。每次调用都必须初始化变量才能使用它们。另一方面,如果变量的生命周期贯穿整个应用程序,则必须在任何函数体之外声明它。这些变量的内容在应用程序的整个执行过程中得到维护,并且可供所有内核使用。
Lifetime tells the portion of the program’s execution duration when the variable is available for use: either within a kernel’s execution or throughout the entire application. If a variable’s lifetime is within a kernel’s execution, it must be declared within the kernel function body and will be available for use only by the kernel’s code. If the kernel is invoked several times, the value of the variable is not maintained across these invocations. Each invocation must initialize the variable to use them. On the other hand, if a variable’s lifetime is throughout the entire application, it must be declared outside of any function body. The contents of these variables are maintained throughout the execution of the application and available to all kernels.
如表 5.1所示,所有在内核和设备函数中声明的自动标量变量都被放入寄存器中。我们将非数组的变量称为标量变量。这些自动变量的范围在各个线程内。当内核函数声明自动变量时,将为每个线程生成该变量的私有副本执行内核函数。当线程终止时,它的所有自动变量也不再存在。在图 5.1中,变量Row、Col和Pvalue都是自动变量,属于这一类。请注意,访问这些变量的速度非常快并且是并行的,但必须小心不要超出硬件实现中寄存器存储的有限容量。我们将在第 6 章中讨论这一点。
As shown in Table 5.1, all automatic scalar variables declared in kernel and device functions are placed into registers. We refer to variables that are not arrays as scalar variables. The scopes of these automatic variables are within individual threads. When a kernel function declares an automatic variable, a private copy of that variable is generated for every thread that executes the kernel function. When a thread terminates, all its automatic variables also cease to exist. In Figure 5.1, variables Row, Col, and Pvalue are all automatic variables and fall into this category. Note that accessing these variables is extremely fast and parallel but one must be careful not to exceed the limited capacity of the register storage in the hardware implementations. We will address this point in Chapter 6.
自动数组变量不存储在寄存器中。2相反,它们存储在全局内存中,可能会导致较长的访问延迟和潜在的访问拥塞。与自动标量变量一样,这些数组的范围仅限于单个线程。也就是说,为每个线程创建并使用每个自动数组的私有版本。一旦线程终止执行,其自动数组变量的内容也不再存在。根据我们的经验,在内核函数和设备函数中很少需要使用自动数组变量。
Automatic array variables are not stored in registers.2 Instead, they are stored into the global memory and may incur long access delays and potential access congestions. The scope of these arrays is, like automatic scalar variables, limited to individual threads. That is, a private version of each automatic array is created for and used by every thread. Once a thread terminates its execution, the contents of its automatic array variables also cease to exist. From our experience, one seldom needs to use automatic array variables in kernel functions and device functions.
如果变量声明前面有关键字 __shared__ (每个__由两个_字符组成),则它在 CUDA 中声明了一个共享变量。还可以在声明中的__shared__前面添加一个可选的__device__来达到相同的效果。这种声明通常驻留在内核函数或设备函数中。共享变量驻留在共享内存中。共享变量的作用域是在一个线程块内,也就是说,一个块中的所有线程都看到相同版本的共享变量。在内核执行期间,为每个线程块创建并使用共享变量的私有版本。共享变量的生命周期在内核的持续时间内。当内核终止执行时,其共享变量的内容将不再存在。正如我们之前讨论的,共享变量是块内线程相互协作的有效手段。从共享内存访问共享变量非常快并且高度并行。 CUDA 程序员经常使用共享变量来保存在内核执行阶段大量使用的全局内存数据部分。人们可能需要调整用于创建主要关注全局内存数据的一小部分的执行阶段的算法,正如我们将在第5.3 节中通过矩阵乘法进行演示的那样。
If a variable declaration is preceded by the keyword __shared__ (each __ consists of two _ characters), it declares a shared variable in CUDA. One can also add an optional __device__ in front of __shared__ in the declaration to achieve the same effect. Such declaration typically resides within a kernel function or a device function. Shared variables reside in shared memory. The scope of a shared variable is within a thread block, that is, all threads in a block see the same version of a shared variable. A private version of the shared variable is created for and used by each thread block during kernel execution. The lifetime of a shared variable is within the duration of the kernel. When a kernel terminates its execution, the contents of its shared variables cease to exist. As we discussed earlier, shared variables are an efficient means for threads within a block to collaborate with each other. Accessing shared variables from the shared memory is extremely fast and highly parallel. CUDA programmers often use shared variables to hold the portion of global memory data that are heavily used in an execution phase of a kernel. One may need to adjust the algorithms used to create execution phases that heavily focus on small portions of the global memory data, as we will demonstrate with matrix multiplication in Section 5.3.
如果变量声明前面有关键字__constant__(每个__由两个_字符组成),则它在 CUDA 中声明一个常量变量。还可以在__constant__前面添加一个可选的__device__来达到相同的效果。常量变量的声明必须是在任何函数体之外。常量变量的作用域是所有网格,这意味着所有网格中的所有线程都看到相同版本的常量变量。常量变量的生命周期是整个应用程序执行的时间。常量变量通常用于为核函数提供输入值的变量。常量变量存储在全局内存中,但会被缓存以便有效访问。通过适当的访问模式,访问常量内存的速度非常快并且是并行的。目前,应用程序中常量变量的总大小限制为 65,536 字节。我们可能需要分解输入数据量以适应这一限制,正如我们将在第 8 章中说明的那样。
If a variable declaration is preceded by the keyword __constant__ (each __ consists of two _ characters), it declares a constant variable in CUDA. One can also add an optional __device__ in front of __constant__ to achieve the same effect. Declaration of constant variables must be outside any function body. The scope of a constant variable is all grids, meaning that all threads in all grids see the same version of a constant variable. The lifetime of a constant variable is the entire application execution. Constant variables are often used for variables that provide input values to kernel functions. Constant variables are stored in the global memory but are cached for efficient access. With appropriate access patterns, accessing constant memory is extremely fast and parallel. Currently, the total size of constant variables in an application is limited at 65,536 bytes. One may need to break up the input data volume to fit within this limitation, as we will illustrate in Chapter 8.
声明前面仅带有关键字__device__(每个__由两个_字符组成)的变量是全局变量,将被放置在全局内存中。对全局变量的访问速度很慢。在较新的设备中,通过缓存改进了访问全局变量的延迟和吞吐量。全局变量的一个重要优点是它们对所有内核的所有线程都是可见的。它们的内容也会在整个执行过程中持续存在。因此,全局变量可以用作线程跨块协作的手段。然而,我们必须意识到这样一个事实:除了终止当前内核执行之外,目前还没有一种简单的方法可以在不同线程块的线程之间进行同步,或者在访问全局内存时确保线程之间的数据一致性。3因此,全局变量通常用于将信息从一个内核调用传递到另一内核调用。
A variable of which the declaration is preceded only by the keyword __device__ (each __ consists of two _ characters) is a global variable and will be placed in the global memory. Accesses to a global variable are slow. Latency and throughput of accessing global variables have been improved with caches in more recent devices. One important advantage of global variables is that they are visible to all threads of all kernels. Their contents also persist through the entire execution. Thus, global variables can be used as a means for threads to collaborate across blocks. One must, however, be aware of the fact that there is currently no easy way to synchronize between threads from different thread blocks or to ensure data consistency across threads when accessing global memory other than terminating the current kernel execution.3 Therefore, global variables are often used to pass information from one kernel invocation to another kernel invocation.
在CUDA中,指针用于指向全局内存中的数据对象。在内核和设备函数中,指针的使用有两种典型的方式。首先,如果一个对象是由主机函数分配的,则指向该对象的指针由cudaMalloc()初始化,并且可以作为参数传递给内核函数。例如,图 5.1中的参数d_M、d_N和d_P就是这样的指针。第二种用法是将全局内存中声明的变量的地址分配给指针变量。例如,内核函数中的语句{float* ptr = &GlobalVar;}将GlobalVar的地址分配给自动指针变量ptr。有关在其他内存类型中使用指针的信息,读者应参阅CUDA 编程指南。
In CUDA, pointers are used to point to data objects in global memory. There are two typical ways in which pointer usage arises in kernel and device functions. First, if an object is allocated by a host function, the pointer to the object is initialized by cudaMalloc() and can be passed to the kernel function as a parameter. For example, the parameters d_M, d_N, and d_P in Figure 5.1 are such pointers. The second type of usage is to assign the address of a variable declared in the global memory to a pointer variable. For example, the statement {float∗ ptr = &GlobalVar;} in a kernel function assigns the address of GlobalVar into an automatic pointer variable ptr. Readers should refer to the CUDA Programming Guide for using pointers in other memory types.
在 CUDA 中使用设备内存时,我们有一个内在的权衡:全局内存大但速度慢,而共享内存小但速度快。一种常见的策略是将数据划分为称为图块的子集,以便每个图块适合共享内存。术语瓦片借鉴了这样的类比:一堵大墙(即,全局存储器数据)可以被瓦片(即,每个都可以放入共享存储器的子集)覆盖。一个重要的标准是这些图块上的内核计算可以彼此独立地完成。请注意,在给定任意核函数的情况下,并非所有数据结构都可以划分为图块。
We have an intrinsic trade-off in the use of device memories in CUDA: global memory is large but slow, whereas the shared memory is small but fast. A common strategy is partition the data into subsets called tiles so that each tile fits into the shared memory. The term tile draws on the analogy that a large wall (i.e., the global memory data) can be covered by tiles (i.e., subsets that each can fit into the shared memory). An important criterion is that the kernel computation on these tiles can be done independently of each other. Note that not all data structures can be partitioned into tiles given an arbitrary kernel function.
平铺的概念可以用矩阵乘法的例子来说明。图 5.5显示了矩阵乘法的一个小例子。对应的核函数如图5.1所示。为简洁起见,我们将d_P[y*Width+x]、d_M[y*Width+x ]和d_N[y*Width+x]分别缩写为 P y,x、M y,x和 N y,x。此示例假设我们使用四个 2×2 块来计算 P 矩阵。图 5.5突出显示了 block(0,0) 的四个线程完成的计算。这四个线程计算 P 0,0、 P 0,1、 P 1,0和 P 1,1。block(0,0) 的 thread(0,0) 和 thread(0,1) 对M和N元素的访问用黑色箭头突出显示。例如, thread(0,0) 读取 M 0,0和 N 0,0,然后是 M 0,1和 N 1,0,然后是 M 0,2和 N 2,0,然后是 M 0,3和N 3,0。
The concept of tiling can be illustrated with the matrix multiplication example. Figure 5.5 shows a small example of matrix multiplication. It corresponds to the kernel function in Figure 5.1. For brevity, we abbreviate d_P[y∗Width+x], d_M[y∗Width+x], and d_N[y∗Width+x] into Py,x, My,x, and Ny,x, respectively. This example assumes that we use four 2×2 blocks to compute the P matrix. Figure 5.5 highlights the computation done by the four threads of block(0,0). These four threads compute P0,0, P0,1, P1,0, and P1,1. The accesses to the M and N elements by thread(0,0) and thread(0,1) of block(0,0) are highlighted with black arrows. For example, thread(0,0) reads M0,0 and N0,0, followed by M0,1, and N1,0 followed by M0,2 and N2,0, followed by M0,3 and N3,0.
图 5.5矩阵乘法的一个小例子。为了简洁起见,我们将d_M[y*Width+x]、d_N[y*Width+x]、d_P[y*Width+x]分别表示为 M y,x、N y,x、P y,x。
Figure 5.5 A small example of matrix multiplication. For brevity, We show d_M[y∗Width+x], d_N[y∗Width+x], d_P[y∗Width+x] as My,x, Ny,x, Py,x, respectively.
图 5.6显示了块0,0中所有线程完成的全局内存访问。线程按垂直方向列出,访问时间在水平方向向右增加。请注意,每个线程在执行期间都会访问M的四个元素和N的四个元素。在突出显示的四个线程中,它们访问的M和N元素存在显着重叠。例如,线程0,0和线程0,1都访问 M 0,0以及M的第 0 行的其余部分。同样,线程0,1和线程1,1都访问 N 0,1以及N的第1列的其余部分。
Figure 5.6 shows the global memory accesses done by all threads in block0,0. The threads are listed in the vertical direction, with time of access increasing to the right in the horizontal direction. Note that each thread accesses four elements of M and four elements of N during its execution. Among the four threads highlighted, there is a significant overlap in terms of the M and N elements they access. For example, thread0,0 and thread0,1 both access M0,0 as well as the rest of row 0 of M. Similarly, thread0,1 and thread1,1 both access N0,1 as well as the rest of column 1 of N.
图 5.6线程在块0,0中执行的全局内存访问。
Figure 5.6 Global memory accesses performed by threads in block0,0.
图 5.1中的内核被编写为线程0,0和线程0,1都从全局内存访问M的第 0 行元素。如果我们能够以某种方式设法让线程0,0和线程1,0协作,以便这些M 个元素仅从全局内存加载一次,那么我们可以将全局内存的访问总数减少一半。一般来说,我们可以看到,在块0,0的执行过程中,每个M和N元素都被访问了两次。因此,如果我们可以让所有四个线程协作访问全局内存,我们就可以将全局内存的流量减少一半。
The kernel in Figure 5.1 is written so that both thread0,0 and thread0,1 access row 0 elements of M from the global memory. If we can somehow manage to have thread0,0 and thread1,0 to collaborate so that these M elements are only loaded from global memory once, we can reduce the total number of accesses to the global memory by half. In general, we can see that every M and N element is accessed exactly twice during the execution of block0,0. Therefore, if we can have all four threads to collaborate in their accesses to global memory, we can reduce the traffic to the global memory by half.
读者应该验证矩阵乘法示例中全局内存流量的潜在减少与所使用的块的尺寸成正比。对于N × N块,全局内存流量的潜在减少将为N。也就是说,如果我们使用 16×16 块,则可以通过线程之间的协作将全局内存流量减少到 1/16。
Readers should verify that the potential reduction in global memory traffic in the matrix multiplication example is proportional to the dimension of the blocks used. With N×N blocks, the potential reduction of global memory traffic would be N. That is, if we use 16×16 blocks, one can potentially reduce the global memory traffic to 1/16 through collaboration between threads.
交通拥堵显然不仅仅出现在计算领域。我们大多数人都经历过高速公路系统的交通拥堵,如图5.7所示。高速公路交通拥堵的根本原因是,太多的汽车挤在一条为车辆数量少得多的道路设计的道路上。当发生拥堵时,出行时间每辆车都大大增加。在交通拥堵期间,上班时间很容易增加一倍或三倍。
Traffic congestion obviously does not only arise in computing. Most of us have experienced traffic congestion in highway systems, as illustrated in Figure 5.7. The root cause of highway traffic congestion is that there are too many cars all squeezing through a road that is designed for a much smaller number of vehicles. When congestion occurs, the travel time for each vehicle is greatly increased. Commute time to work can easily double or triple during traffic congestion.
图 5.7减少高速公路系统的交通拥堵。
Figure 5.7 Reducing traffic congestion in highway systems.
所有提出的减少交通拥堵的解决方案都涉及减少道路上的汽车。假设通勤人数不变,人们需要拼车以减少道路上的汽车数量。在美国,一种常见的拼车方式是拼车,一群通勤者轮流驾驶一辆车去上班。在一些国家,政府只是禁止某些类别的汽车每天上路。例如,星期一、星期三或星期五可能不允许牌照奇数的汽车上路。这鼓励了在不同日期允许使用汽车的人组成拼车团体。也有一些国家的政府让汽油变得非常昂贵,以至于人们为了省钱而拼车。在其他国家,政府可能会对减少道路上汽车数量的行为提供激励。在美国,拥堵高速公路的一些车道被指定为拼车车道,只有两到三人以上的汽车才允许使用这些车道。所有这些鼓励拼车的措施都是为了克服拼车需要额外努力的事实,如图5.8所示。
All proposed solutions for reduced traffic congestion involve reduction of cars on the road. Assuming that the number of commuters is constant, people need to share rides to reduce the number of cars on the road. A common way to share rides in the United States is carpools, where a group of commuters take turns to drive the group to work in one vehicle. In some countries, the government simply disallows certain classes of cars to be on the road on a daily basis. For example, cars with odd license plates may not be allowed on the road on Monday, Wednesday, or Friday. This encourages people whose cars are allowed on different days to form a carpool group. There are also countries where the government makes gasoline so expensive that people form carpools to save money. In other countries, the government may provide incentives for behavior that reduces the number of cars on the road. In the United States, some lanes of congested highways are designated as carpool lanes—only cars with more than two or three people are allowed to use these lanes. All these measures for encouraging carpooling are designed to overcome the fact that carpooling requires extra effort, as we show in Figure 5.8.
图5.8拼车需要人与人之间的同步。
Figure 5.8 Carpooling requires synchronization among people.
图 5.8的上半部分显示了一个良好的拼车时间表模式。时间从左向右。工人A和工人B有类似的情况睡眠、工作和晚餐的时间表。这使得这两名工人可以轻松地用一辆车上下班和回家。他们相似的时间表使他们更容易就共同的出发时间和返回时间达成一致。然而,图 5.8下半部分所示的时间表并非如此。在这种情况下,工人 A 和工人 B 的习惯截然不同。工人A聚会到天亮,白天睡觉,晚上上班。工人 B 晚上睡觉,早上上班,下午 6 点回家吃晚饭。时间安排相差太大,两个工人根本无法协调同一时间开一辆车上班和回家。 。为了让这些工人组成拼车,他们需要协商一个类似于图 5.8上半部分所示的共同时间表。
The top half of Figure 5.8 shows a good schedule pattern for carpooling. Time goes from left to right. Worker A and worker B have similar schedules for sleep, work, and dinner. This allows these two workers to easily go to work and return home in one car. Their similar schedules allow them to more easily agree on a common departure time and return time. This is, however, not the case of the schedules shown in the bottom half of Figure 5.8. Worker A and worker B have very different habits in this case. Worker A parties until sunrise, sleeps during the day, and goes to work in the evening. Worker B sleeps at night, goes to work in the morning, and returns home for dinner at 6 p.m. The schedules are so wildly different that there is no way these two workers can coordinate a common time to drive to work and return home in one car. For these workers to form a carpool, they need to negotiate a common schedule similar to what is shown in the top half of Figure 5.8.
平铺算法与拼车安排非常相似。我们可以将每个线程访问的数据值视为通勤者,将请求的 DRAM 视为车辆。当DRAM请求的速率超过DRAM系统的规定带宽时,就会出现流量拥塞并且运算单元变得空闲。如果多个线程从同一 DRAM 位置访问数据,它们可以形成“拼车”并将其访问合并到一个 DRAM 请求中。然而,这要求线程具有相似的执行调度,以便它们的数据访问可以合并为一个。如图 5.9所示,其中顶部显示两个线程以相似的时序访问相同的数据元素。下半部分显示了两个线程在不同的时间访问其公共数据。下半部分排列不好的原因是从 DRAM 带回的数据元素需要长时间保存在片上内存中,等待线程 2 消耗它们。这可能需要保留大量数据元素,因此需要大量的片上存储器。正如我们将在下一节中展示的,我们将使用屏障同步来保持形成“拼车”组的线程遵循大致相同的执行时序。
Tiled algorithms are very similar to carpooling arrangements. We can think of data values accessed by each thread as commuters and DRAM requested as vehicles. When the rate of DRAM requests exceeds the provisioned bandwidth of the DRAM system, traffic congestion arises and the arithmetic units become idle. If multiple threads access data from the same DRAM location, they can form a “carpool” and combine their accesses into one DRAM request. This, however, requires the threads to have a similar execution schedule so that their data accesses can be combined into one. This is shown in Figure 5.9, where the top portion shows two threads that access the same data elements with similar timing. The bottom half shows two threads that access their common data in very different times. The reason why the bottom half is a bad arrangement is that data elements brought back from the DRAM need to be kept in the on-chip memory for a long time, waiting for thread 2 to consume them. This will likely require a large number of data elements to be kept around, thus large on-chip memory requirements. As we will show in the next section, we will use barrier synchronization to keep the threads that form the “carpool” group to follow approximately the same execution timing.
图5.9平铺算法需要线程之间的同步。
Figure 5.9 Tiled algorithms require synchronization among threads.
我们现在提出一种算法,其中线程协作以减少全局内存的流量。基本思想是让线程协作将M和N 个元素加载到共享内存中,然后再在点积计算中单独使用这些元素。请记住,共享内存的大小非常小,在将这M和N个元素加载到共享内存中时必须小心,不要超出共享内存的容量。这可以通过将 M 和 N 矩阵划分为更小的块来实现。这些图块的大小经过选择,以便它们可以适合共享内存。在最简单的形式中,图块尺寸等于块的尺寸,如图5.10所示。
We now present an algorithm where threads collaborate to reduce the traffic to the global memory. The basic idea is to have the threads to collaboratively load M and N elements into the shared memory before they individually use these elements in their dot product calculation. Keep in mind that the size of the shared memory is quite small and one must be careful not to exceed the capacity of the shared memory when loading these M and N elements into the shared memory. This can be accomplished by dividing the M and N matrices into smaller tiles. The size of these tiles is chosen so that they can fit into the shared memory. In the simplest form, the tile dimensions equal those of the block, as illustrated in Figure 5.10.
图 5.10平铺 M 和 N 矩阵以利用共享内存。
Figure 5.10 Tiling M and N matrices to utilize shared memory.
在图 5.10中,我们将 M 和 N 矩阵划分为 2×2 块,如粗线所示。点积计算由每个线程现在分为几个阶段。在每个阶段,块中的所有线程协作将M 个元素的图块和N 个元素的图块加载到共享内存中。这是通过让块中的每个线程将一个M元素和一个N元素加载到共享内存中来完成的,如图5.11所示。图5.11的每一行显示了执行过程线程的活动。请注意,时间是从左向右进行的。我们只需要显示块0,0中线程的活动;其他块都有相同的行为。M 个元素的共享内存数组称为Mds。N 个元素的共享内存数组称为Nds。在第 1 阶段开始时,块0,0的四个线程协作将M 个元素的图块加载到共享内存中:线程0,0将 M 0,0加载到 Mds 0,0中,线程0,1加载 M 0,1到 Mds 0,1中,线程1,0将 M 1,0加载到 Mds 1,0中,线程1,1将 M 1,1加载到 Mds 1,1中。见图5.11第二列。N 个元素的图块也以类似的方式加载,如图 5.11的第三列所示。
In Figure 5.10, we divide the M and N matrices into 2×2 tiles, as delineated by the thick lines. The dot product calculations performed by each thread are now divided into phases. In each phase, all threads in a block collaborate to load a tile of M elements and a tile of N elements into the shared memory. This is done by having every thread in a block to load one M element and one N element into the shared memory, as illustrated in Figure 5.11. Each row of Figure 5.11 shows the execution activities of a thread. Note that time progresses from left to right. We only need to show the activities of threads in block0,0; the other blocks all have the same behavior. The shared memory array for the M elements is called Mds. The shared memory array for the N elements is called Nds. At the beginning of phase 1, the four threads of block0,0 collaboratively load a tile of M elements into shared memory: thread0,0 loads M0,0 into Mds0,0, thread0,1 loads M0,1 into Mds0,1, thread1,0 loads M1,0 into Mds1,0, and thread1,1 loads M1,1 into Mds1,1. See the second column of Figure 5.11. A tile of N elements is also loaded in a similar manner, shown in the third column of Figure 5.11.
图 5.11平铺矩阵乘法的执行阶段。
Figure 5.11 Execution phases of a tiled matrix multiplication.
将M和N元素的两个图块加载到共享内存后,这些值将用于计算点积。请注意,共享内存中的每个值都会使用两次。例如,由线程1,1加载到 Mds 1,1中的M 1,1值使用两次,一次由线程0,1使用,一次由线程1,1使用。通过将每个全局内存值加载到共享内存中以便可以多次使用,我们减少了对全局内存的访问次数。在这种情况下,我们将对全局内存的访问次数减少一半。如果图块是N × N元素,读者应验证是否减少了N倍。
After the two tiles of M and N elements are loaded into the shared memory, these values are used in the calculation of the dot product. Note that each value in the shared memory is used twice. For example, the M1,1 value, loaded by thread1,1 into Mds1,1, is used twice, once by thread0,1 and once by thread1,1. By loading each global memory value into shared memory so that it can be used multiple times, we reduce the number of accesses to the global memory. In this case, we reduce the number of accesses to the global memory by half. Readers should verify that the reduction is by a factor of N if the tiles are N×N elements.
请注意,图 5.6中每个点积的计算现在分两个阶段执行,如图5.11中的阶段 1 和阶段 2 所示。在每个阶段,两对输入矩阵元素的乘积被累加到Pvalue变量中。请注意,Pvalue是一个自动变量,因此会为每个线程生成一个私有版本。我们添加下标只是为了澄清这些是为每个线程创建的Pvalue变量的不同实例。第一阶段计算如图5.11第四栏所示;第二阶段在第七列。一般来说,如果输入矩阵的维度为N并且图块大小为TILE_WIDTH,则点积将在N/TILE_WIDTH阶段中执行。这些阶段的创建是减少对全局存储器的访问的关键。每个阶段都专注于输入矩阵值的一个小子集,线程可以协作地将子集加载到共享内存中,并使用共享内存中的值来满足该阶段中的重叠输入需求。
Note that the calculation of each dot product in Figure 5.6 is now performed in two phases, shown as phase 1 and phase 2 in Figure 5.11. In each phase, products of two pairs of the input matrix elements are accumulated into the Pvalue variable. Note that Pvalue is an automatic variable so a private version is generated for each thread. We added subscripts just to clarify that these are different instances of the Pvalue variable created for each thread. The first phase calculation is shown in the fourth column of Figure 5.11; the second phase in the seventh column. In general, if an input matrix is of dimension N and the tile size is TILE_WIDTH, the dot product would be performed in N/TILE_WIDTH phases. The creation of these phases is key to the reduction of accesses to the global memory. With each phase focusing on a small subset of the input matrix values, the threads can collaboratively load the subset into the shared memory and use the values in the shared memory to satisfy their overlapping input needs in the phase.
另请注意,Mds和Nds被重新用于保存输入值。在每个阶段中,相同的Mds和Nds用于保存该阶段中使用的M和N元素的子集。这允许更小的共享内存来服务对全局内存的大部分访问。这是因为每个阶段都专注于输入矩阵元素的一小部分。这种集中的访问行为称为局部性。当算法表现出局部性,有机会使用小型高速存储器来服务大多数访问并将这些访问从全局存储器中删除。局部性对于在多核 CPU 中实现高性能与在多线程 GPU 中一样重要。我们回到第 6 章中的局部性概念。
Note also that Mds and Nds are reused to hold the input values. In each phase, the same Mds and Nds are used to hold the subset of M and N elements used in the phase. This allows a much smaller shared memory to serve most of the accesses to global memory. This is due to the fact that each phase focuses on a small subset of the input matrix elements. Such focused access behavior is called locality. When an algorithm exhibits locality, there is an opportunity to use small, high-speed memories to serve most of the accesses and remove these accesses from the global memory. Locality is as important for achieving high performance in multicore CPUs as in many-thread GPUs. We return to the concept of locality in Chapter 6.
我们现在准备展示使用共享内存来减少全局内存流量的平铺内核函数。图 5.12所示的内核实现了图 5.11所示的阶段。在图 5.12中,第 1 行和第 2 行将Mds和Nds声明为共享内存变量。回想一下,共享内存变量的范围是一个块。因此,将为每个块创建一对Mds和Nds,并且块的所有线程都可以访问相同的Mds和Nds。这很重要,因为块中的所有线程都必须有权访问由其对等方加载到Mds和Nds中的M和N值,以便它们可以使用这些值来满足其输入需求。
We are now ready to present the tiled kernel function that uses shared memory to reduce the traffic to global memory. The kernel shown in Figure 5.12 implements the phases illustrated in Figure 5.11. In Figure 5.12, lines 1 and 2 declare Mds and Nds as shared memory variables. Recall that the scope of shared memory variables is a block. Thus, one pair of Mds and Nds will be created for each block and all threads of a block have access to the same Mds and Nds. This is important since all threads in a block must have access to the M and N values loaded into Mds and Nds by their peers so that they can use these values to satisfy their input needs.
图 5.12使用共享内存的平铺矩阵乘法内核。
Figure 5.12 Tiled matrix multiplication kernel using shared memory.
#定义TILE_WIDTH 16
#define TILE_WIDTH 16
第 3 行和第 4 行将threadIdx和blockIdx值保存到自动变量中,从而保存到寄存器中以便快速访问。回想一下自动标量变量被放入寄存器中。它们的范围是在每个单独的线程中。也就是说,运行时系统为每个线程创建一个私有版本的tx、ty、bx和by 。它们将驻留在可由一个线程访问的寄存器中。它们使用threadIdx和blockIdx值进行初始化,并在线程的生命周期内多次使用。一旦线程结束,这些变量的值也不再存在。
Lines 3 and 4 save the threadIdx and blockIdx values into automatic variables and thus into registers for fast access. Recall that automatic scalar variables are placed into registers. Their scope is in each individual thread. That is, one private version of tx, ty, bx, and by is created by the runtime system for each thread. They will reside in registers that are accessible by one thread. They are initialized with the threadIdx and blockIdx values and used many times during the lifetime of the thread. Once the thread ends, the values of these variables also cease to exist.
第 5 行和第 6 行确定线程要生成的d_P元素的行索引和列索引。如第 6 行所示,线程要生成的d_P元素的水平 ( x ) 位置或列索引可以计算为bx*TILE_WIDTH+tx。这是因为每个块在水平维度上覆盖TILE_WIDTH元素。 bx块中的线程在其之前将有bx 个线程块,或(bx*TILE_WIDTH)个线程;它们覆盖d_P的bx*TILE_WIDTH元素。同一块内的另一个tx线程将覆盖d_P的另一个tx元素。因此,具有bx和tx的线程应该负责计算x索引为bx*TILE_WIDTH+tx的d_P元素。该水平索引保存在线程的变量Col (列)中,如图5.13所示。对于图5.10中的示例,块1,0的线程0,1要计算的d_P元素的x索引是0×2+1=1。类似地,y索引可以通过*TILE_WIDTH+ty计算。该垂直索引保存在线程的变量Row中。因此,如图5.10所示,每个线程计算Col列和Row行处的d_P元素。回到图5.10中的示例,块0,1的线程1,0要计算的d_P元素的y索引是1×2+0=2。因此,该线程要计算的d_P元素是d_P 2,1。
Lines 5 and 6 determine the row index and column index of the d_P element that the thread is to produce. As shown in line 6, the horizontal (x) position, or the column index of the d_P element to be produced by a thread, can be calculated as bx∗TILE_WIDTH+tx. This is because each block covers TILE_WIDTH elements in the horizontal dimension. A thread in block bx would have bx blocks of threads, or (bx∗TILE_WIDTH) threads, before it; they cover bx∗TILE_WIDTH elements of d_P. Another tx thread within the same block would cover another tx element of d_P. Thus, the thread with bx and tx should be responsible for calculating the d_P element of which the xindex is bx∗TILE_WIDTH+tx. This horizontal index is saved in the variable Col (for column) for the thread and is also illustrated in Figure 5.13. For the example in Figure 5.10, the x index of the d_P element to be calculated by thread0,1 of block1,0 is 0×2+1=1. Similarly, the y index can be calculated as by∗TILE_WIDTH+ty. This vertical index is saved in the variable Row for the thread. Thus, as shown in Figure 5.10, each thread calculates the d_P element at the Col column and the Row row. Going back to the example in Figure 5.10, the y index of the d_P element to be calculated by thread1,0 of block0,1 is 1×2+0=2. Thus, the d_P element to be calculated by this thread is d_P2,1.
图 5.13平铺乘法中矩阵索引的计算。
Figure 5.13 Calculation of the matrix indices in tiled multiplication.
图 5.12的第 8行标记了循环的开始,该循环迭代计算最终d_P元素的所有阶段。循环的每次迭代对应于图 5.11所示的计算的一个阶段。m变量表示点积已完成的阶段数。回想一下,每个阶段都使用d_M 元素的一个图块和d_N元素的一个图块。因此,在每个阶段开始时,前面的阶段已处理了m*TILE_WIDTH对d_M和d_N元素。
Line 8 of Figure 5.12 marks the beginning of the loop that iterates through all the phases of calculating the final d_P element. Each iteration of the loop corresponds to one phase of the calculation shown in Figure 5.11. The m variable indicates the number of phases that have already been done for the dot product. Recall that each phase uses one tile of d_M and one tile of d_N elements. Therefore, at the beginning of each phase, m∗TILE_WIDTH pairs of d_M and d_N elements have been processed by previous phases.
在每个阶段,第 9 行将适当的d_M元素加载到共享内存中。由于我们已经知道线程要处理的d_M的行和d_N的列,因此我们将重点关注d_M的列索引和d_N的行索引。如图5.11所示,每个块都有TILE_WIDTH 2 个线程,它们将协作将TILE_WIDTH 2 d_M元素加载到共享内存中。因此,我们需要做的就是分配每个线程加载一个d_M元素。使用blockIdx和线程Idx。请注意,要加载的d_M元素部分的起始列索引是m*TILE_WIDTH。因此,一种简单的方法是让每个线程从包含threadIdx.x值的偏移量tx处加载一个元素。这正是我们在第 9 行中所看到的,其中每个线程加载d_M[Row*Width + m*TILE_WIDTH + tx]。由于Row的值是ty的线性函数,因此每个TILE_WIDTH 2线程都会将唯一的d_M元素加载到共享内存中。这些线程将一起加载图 5.13中d_M的暗方形子集。读者应该使用图 5.5和5.6中的小示例来验证地址计算是否正确。
In each phase, line 9 loads the appropriate d_M element into the shared memory. Since we already know the row of d_M and column of d_N to be processed by the thread, we will focus on the column index of d_M and row index of d_N. As shown in Figure 5.11, each block has TILE_WIDTH2 threads that will collaborate to load TILE_WIDTH2 d_M elements into the shared memory. Thus, all we need to do is to assign each thread to load one d_M element. This is conveniently done using the blockIdx and threadIdx. Note that the beginning column index of the section of d_M elements to be loaded is m∗TILE_WIDTH. Therefore, an easy approach is to have every thread load an element from at an offset tx that contains threadIdx.x value. This is precisely what we have in line 9, where each thread loads d_M[Row∗Width + m∗TILE_WIDTH + tx]. Since the value of Row is a linear function of ty, each of the TILE_WIDTH2 threads will load a unique d_M element into the shared memory. Together, these threads will load the dark square subset of d_M in Figure 5.13. Readers should use the small example in Figures 5.5 and 5.6 to verify that the address calculation works correctly.
第 11 行中的屏障__syncthreads()确保所有线程都已完成将d_M和d_N的图块加载到Mds和Nds中,然后才能继续前进。然后,第 12 行中的循环基于这些图块元素执行点积的一阶段。thread(ty,tx)的循环过程如图5.13所示,其中d_M和d_N元素沿标有k (第 12 行中的循环变量)的箭头使用。请注意,这些元素将从Mds和Nds (保存这些d_M和d_N元素的共享内存数组)访问。第 14 行中的屏障__syncthreads()确保所有线程都已完成对共享内存中d_M和d_N元素的使用,然后再进行下一次迭代并加载下一个图块中的元素。这样,任何线程都不会过早加载元素并损坏其他线程的输入值。
The barrier __syncthreads() in line 11 ensures that all threads have finished loading the tiles of d_M and d_N into Mds and Nds before any of them can move forward. The loop in line 12 then performs one phase of the dot product based on these tile elements. The progression of the loop for thread(ty,tx) is shown in Figure 5.13, with the direction of d_M and d_N elements usage along the arrow marked with k, the loop variable in line 12. Note that these elements will be accessed from Mds and Nds, the shared memory arrays holding these d_M and d_N elements. The barrier __syncthreads() in line 14 ensures that all threads have finished using the d_M and d_N elements in the shared memory before any of them move on to the next iteration and load the elements in the next tiles. This way, none of the threads would load the elements too early and corrupt the input values for other threads.
点积的所有部分完成后,执行退出第 8 行的循环。所有线程使用Row和Col写入其d_P元素。
After all sections of the dot product are complete, the execution exits the loop of line 8. All threads write to their d_P element using the Row and Col.
平铺算法的好处是巨大的。对于矩阵乘法,全局内存访问减少了TILE_WIDTH倍。如果使用 16×16 块,我们可以将全局内存访问减少 16 倍。这将 CGMA 从 1 增加到 16。这一改进允许 CUDA 设备的内存带宽支持接近其峰值性能的计算速率。例如,这一改进使得 150 GB/s 的全局内存带宽能够支持 (150/4)×16=600 GFLOPS!
The benefit of the tiled algorithm is substantial. For matrix multiplication, the global memory accesses are reduced by a factor of TILE_WIDTH. If one uses 16×16 tiles, we can reduce the global memory accesses by a factor of 16. This increases the CGMA from 1 to 16. This improvement allows the memory bandwidth of a CUDA device to support a computation rate close to its peak performance. For example, this improvement allows a 150 GB/s global memory bandwidth to support (150/4)×16=600 GFLOPS!
虽然 CUDA 寄存器和共享内存可以非常有效地减少对全局内存的访问次数,但必须小心不要超出这些内存的容量。这些内存是线程执行所需的资源形式。每个 CUDA 设备提供的资源数量有限,这限制了给定应用程序可以同时驻留在 SM 中的线程数量。一般来说,每个线程需要的资源越多,每个SM中可以驻留的线程数量就越少,因此整个设备中可以驻留的线程数量就越少。
While CUDA registers and shared memory can be extremely effective in reducing the number of accesses to global memory, one must be careful not to exceed the capacity of these memories. These memories are forms of resources that are needed for thread execution. Each CUDA device offers a limited amount of resources, which limits the number threads that can simultaneously reside in the SM for a given application. In general, the more resources each thread requires, the fewer the number of threads can reside in each SM, and thus the fewer number of threads that can reside in the entire device.
让我们用一个例子来说明内核的寄存器使用和设备可以支持的并行级别之间的交互。假设在设备D中,每个SM最多可以容纳1536个线程,并且有16384个寄存器。虽然 16,384 是一个很大的数字,但考虑到每个 SM 中可以驻留的线程数量,它只允许每个线程使用非常有限数量的寄存器。为了支持 1,536 个线程,每个线程只能使用 16,384÷1,536=10 个寄存器。如果每个线程使用11个寄存器,那么每个SM中能够并发执行的线程数量将会减少。这种减少是在块粒度上完成的。例如,如果每个块包含512个线程,线程的减少将通过一次减少512个线程来完成。因此,1,536 个线程的下一个数量将是 512 个,即可以同时驻留在每个 SM 中的线程数量减少了三分之一。这可以大大减少可用于调度的扭曲数量,从而降低处理器在存在长延迟操作时找到有用工作的能力。
Let’s use an example to illustrate the interaction between register usage of a kernel and the level of parallelism that a device can support. Assume that in a device D, each SM can accommodate up to 1,536 threads and has 16,384 registers. While 16,384 is a large number, it only allows each thread to use a very limited number of registers considering the number of threads that can reside in each SM. To support 1,536 threads, each thread can use only 16,384÷1,536=10 registers. If each thread uses 11 registers, the number of threads able to be executed concurrently in each SM will be reduced. Such reduction is done at the block granularity. For example, if each block contains 512 threads, the reduction of threads will be done by reducing 512 threads at a time. Thus, the next lower number of threads from 1,536 would be 512, a one-third reduction of threads that can simultaneously reside in each SM. This can greatly reduce the number of warps available for scheduling, thus reducing the processor’s ability to find useful work in the presence of long-latency operations.
请注意,每个 SM 可用的寄存器数量因设备而异。应用程序可以动态确定所用设备的每个 SM 中可用的寄存器数量,并选择使用适合该设备的寄存器数量的内核版本。这可以通过调用cudaGetDeviceProperties()函数来完成,该函数的使用已在第 4.6 节中讨论。假设变量&dev_prop被传递给设备属性的函数,并且字段dev_prop.regsPerBlock给出了每个 SM 中可用的寄存器数量。对于设备 D,该字段的返回值应为 16,384。然后,应用程序可以将该数量除以驻留在每个 SM 中的目标线程数量,以确定可以在内核中使用的寄存器数量。
Note that the number of registers available to each SM varies from device to device. An application can dynamically determine the number of registers available in each SM of the device used and choose a version of the kernel that uses the number of registers appropriate for the device. This can be done by calling the cudaGetDeviceProperties() function, the use of which was discussed in Section 4.6. Assume that variable &dev_prop is passed to the function for the device property, and the field dev_prop.regsPerBlock gives the number of registers available in each SM. For device D, the returned value for this field should be 16,384. The application can then divide this number by the target number of threads to reside in each SM to determine the number of registers that can be used in the kernel.
共享内存的使用还可以限制分配给每个 SM 的线程数量。假设设备 D 在每个 SM 中具有 16,384 (16 K) 字节的共享内存。请记住,共享内存是按块使用的。假设每个SM最多可容纳8个块。要达到此最大值,每个块不得使用超过 2 K 字节的共享内存。如果每个块使用超过2K字节的内存,则每个SM中可以驻留的块的数量使得这些块使用的共享内存总量不超过16K字节。例如,如果每个块使用 5 K 字节的共享内存,则不能为每个 SM 分配超过 3 个块。
Shared memory usage can also limit the number of threads assigned to each SM. Assume device D has 16,384 (16 K) bytes of shared memory in each SM. Keep in mind that shared memory is used by blocks. Assume that each SM can accommodate up to eight blocks. To reach this maximum, each block must not use more than 2 K bytes of shared memory. If each block uses more than 2 K bytes of memory, the number of blocks that can reside in each SM is such that the total amount of shared memory used by these blocks does not exceed 16 K bytes. For example, if each block uses 5 K bytes of shared memory, no more than three blocks can be assigned to each SM.
对于矩阵乘法示例,共享内存可能成为限制因素。对于 16×16 的图块大小,每个块需要 16×16×4=1 K 字节的存储空间来存储Mds。Nds还需要 1 KB 。因此,每个块使用 2 K 字节的共享内存。 16 K 字节共享内存允许八个块同时驻留在 SM 中。由于这与线程硬件允许的最大值相同,因此共享内存不是该图块大小的限制因素。在这种情况下,真正的限制是线程硬件限制,即每个 SM 中仅允许 768 个线程。这将每个 SM 中的块数量限制为三个。因此,仅使用 3×2 KB=6 KB 的共享内存。这些限制确实会随着下一代设备的变化而变化,但可以在运行时确定的属性,例如GT200系列处理器在每个SM中最多可以支持1,024个线程。
For the matrix multiplication example, shared memory can become a limiting factor. For a tile size of 16×16, each block needs a 16×16×4=1 K bytes of storage for Mds. Another 1 KB is needed for Nds. Thus, each block uses 2 K bytes of shared memory. The 16 K–byte shared memory allows eight blocks to simultaneous reside in an SM. Since this is the same as the maximum allowed by the threading hardware, shared memory is not a limiting factor for this tile size. In this case, the real limitation is the threading hardware limitation that only 768 threads are allowed in each SM. This limits the number of blocks in each SM to three. As a result, only 3×2 KB=6 KB of the shared memory will be used. These limits do change from device generation to the next but are properties that can be determined at runtime, for example, the GT200 series of processors can support up to 1,024 threads in each SM.
请注意,每个 SM 中的共享内存大小也可能因设备而异。每一代或每一型号的设备在每个 SM 中可以具有不同数量的共享内存。通常希望内核能够根据硬件中的可用量使用不同数量的共享内存。也就是说,我们可能希望有一个内核来动态确定共享内存的大小并调整共享内存的使用量。这可以通过调用cudaGetDeviceProperties()函数来完成,其一般用法已在第 4.6 节中讨论。假设变量&dev_prop被传递给函数,字段dev_prop.sharedMemPerBlock给出了每个 SM 中可用的寄存器数量。然后,程序员可以确定每个块应使用的共享内存量。
Note that the size of shared memory in each SM can also vary from device to device. Each generation or model of device can have a different amount of shared memory in each SM. It is often desirable for a kernel to be able to use a different amount of shared memory according to the amount available in the hardware. That is, we may want to have a kernel to dynamically determine the size of the shared memory and adjust the amount of shared memory used. This can be done by calling the cudaGetDeviceProperties() function, the general use of which was discussed in Section 4.6. Assume that variable &dev_prop is passed to the function, the field dev_prop.sharedMemPerBlock gives the number of registers available in each SM. The programmer can then determine the number of amount of shared memory that should be used by each block.
不幸的是,图 5.12中的内核不支持这一点。图 5.12中使用的声明将其共享内存使用量的大小硬连接到编译时常量:
Unfortunately, the kernel in Figure 5.12 does not support this. The declarations used in Figure 5.12 hardwire the size of its shared memory usage to a compile-time constant:
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
也就是说,无论编译时TILE_WIDTH的值设置为多少,Mds和Nds的大小都设置为TILE_WIDTH 2 个元素。例如,假设该文件包含
That is, the size of Mds and Nds is set to be TILE_WIDTH2 elements, whatever the value of TILE_WIDTH is set to be at compile time. For example, assume that the file contains
#定义TILE_WIDTH 16
#define TILE_WIDTH 16
Mds和Nds都有 256 个元素。如果我们想改变Mds和Nds的大小,我们需要改变TILE_WIDTH的值并重新编译。如果不重新编译,内核无法在运行时轻松调整其共享内存使用情况。我们可以在 CUDA 中使用不同风格的声明来启用此类调整。我们可以在共享内存声明前面添加一个C extern关键字,并在声明中省略数组的大小。基于这种风格,Mds和Nds的声明变为:
Both Mds and Nds will have 256 elements. If we want to change the size of Mds and Nds, we need to change the value of TILE_WIDTH and recompile. The kernel cannot easily adjust its shared memory usage at runtime without recompilation. We can enable such adjustment with a different style of declaration in CUDA. We can add a C extern keyword in front of the shared memory declaration and omit the size of the array in the declaration. Based on this style, the declaration for Mds and Nds become:
外部 __shared__ Mds[];
extern __shared__ Mds[];
外部 __shared__ Nds[];
extern __shared__ Nds[];
请注意,数组现在是一维的。我们需要使用基于垂直和水平索引的线性化索引。
Note that the arrays are now one dimensional. We will need to use a linearized index based on the vertical and horizontal indices.
在运行时,当我们启动内核时,我们可以根据设备查询结果动态确定要使用的共享内存量,并将其作为第三个配置参数提供给内核发射。例如,图 4.18 中的内核启动语句可以替换为以下语句:
At runtime, when we launch the kernel, we can dynamically determine the amount of shared memory to be used according to the device query result and supply that as a third configuration parameter to the kernel launch. For example, the kernel launch statement in Figure 4.18 could be replaced with the following statements:
size_t 尺寸 =
size_t size =
calculate_property_SM_usage(dev_prop.sharedMemPerBlock,…);
calculate_appropriate_SM_usage(dev_prop.sharedMemPerBlock,…);
矩阵MulKernel<<<dimGrid,dimBlock,尺寸>>>(Md,Nd,Pd,宽度);
matrixMulKernel<<<dimGrid, dimBlock, size>>>(Md, Nd, Pd, Width);
其中size_t是一个内置类型,用于声明一个变量来保存动态分配的数据结构的大小信息。我们省略了运行时设置 size 值的计算细节。
where size_t is a built-in type for declaring a variable to hold the size information for dynamically allocated data structures. We have omitted the details of the calculation for setting the value of size at runtime.
总之,CUDA 定义了寄存器、共享内存和常量内存,它们可以比全局内存以更高的速度和更并行的方式进行访问。有效地使用这些存储器可能需要重新设计算法。我们以矩阵乘法为例来说明平铺算法,这是一种增强数据访问局部性并有效利用共享内存的流行策略。我们证明,通过 16×16 平铺,全局内存访问不再是矩阵乘法性能的主要限制因素。
In summary, CUDA defines registers, shared memory, and constant memory that can be accessed at a higher speed and in a more parallel manner than the global memory. Using these memories effectively will likely require redesign of the algorithm. We use matrix multiplication as an example to illustrate tiled algorithms, a popular strategy to enhance locality of data access and enable effective use of shared memory. We demonstrate that with 16×16 tiling, global memory accesses are no longer the major limiting factor for matrix multiplication performance.
然而,对于 CUDA 程序员来说,了解这些类型内存的有限大小非常重要。他们的能力取决于实施。一旦超出它们的容量,它们就会成为每个 SM 中可以同时执行的线程数量的限制因素。开发应用程序时推理硬件限制的能力是计算思维的一个关键方面。读者还可以参考附录 B,了解几种不同设备的资源限制摘要。
It is, however, important for CUDA programmers to be aware of the limited sizes of these types of memory. Their capacities are implementation dependent. Once their capacities are exceeded, they become limiting factors for the number of threads that can be simultaneously executing in each SM. The ability to reason about hardware limitations when developing an application is a key aspect of computational thinking. Readers are also referred to Appendix B for a summary of resource limitations of several different devices.
尽管我们在 CUDA 编程背景下引入了平铺算法,但它是在几乎所有类型的并行计算系统中实现高性能的有效策略。原因是应用程序必须在数据访问中表现出局部性,才能有效利用这些系统中的高速存储器。例如,在多核CPU系统中,数据局部性允许应用程序有效地使用片上数据缓存来减少内存访问延迟并实现高性能。因此,当读者使用其他编程模型为其他类型的并行计算系统开发并行应用程序时,会发现平铺算法很有用。
Although we introduced tiled algorithms in the context of CUDA programming, it is an effective strategy for achieving high performance in virtually all types of parallel computing systems. The reason is that an application must exhibit locality in data access to make effective use of high-speed memories in these systems. For example, in a multicore CPU system, data locality allows an application to effectively use on-chip data caches to reduce memory access latency and achieve high performance. Therefore, readers will find the tiled algorithm useful when they develop a parallel application for other types of parallel computing systems using other programming models.
本章的目标是介绍不同类型的 CUDA 内存。我们引入平铺算法作为使用共享内存的有效策略。我们还没有讨论常量内存的使用,这将在第 8 章中解释。
Our goal for this chapter is to introduce the different types of CUDA memory. We introduced tiled algorithm as an effective strategy for using shared memory. We have not discussed the use of constant memory, which will be explained in Chapter 8.
5.1. 考虑练习 3.1中的矩阵加法。可以使用共享内存来减少全局内存带宽消耗吗?提示:分析每个线程访问的元素,看看线程之间是否有共性。
5.1. Consider the matrix addition in Exercise 3.1. Can one use shared memory to reduce the global memory bandwidth consumption? Hint: analyze the elements accessed by each thread and see if there is any commonality between threads.
5.2. 绘制与 2×2 平铺和 4×4 平铺的 8×8 矩阵乘法等效的图 5.6 。验证全局内存带宽的减少确实与图块的尺寸大小成正比。
5.2. Draw the equivalent of Figure 5.6 for a 8×8 matrix multiplication with 2×2 tiling and 4×4 tiling. Verify that the reduction in global memory bandwidth is indeed proportional to the dimension size of the tiles.
5.3. 如果忘记在图 5.12的内核中使用syncthreads() ,可能会发生什么类型的不正确执行行为?
5.3. What type of incorrect execution behavior can happen if one forgots to use syncthreads() in the kernel of Figure 5.12?
5.4. 假设容量对于寄存器或共享内存来说不是问题,请给出一种情况,使用共享内存而不是寄存器来保存从全局内存中获取的值会更有价值?解释你的答案。
5.4. Assuming capacity was not an issue for registers or shared memory, give one case that it would be valuable to use shared memory instead of registers to hold values fetched from global memory? Explain your answer.
5.5. 对于我们的平铺矩阵-矩阵乘法内核,如果我们使用 32×32 平铺,输入矩阵 M 和 N 的内存带宽使用量会减少多少?
5.5. For our tiled matrix–matrix multiplication kernel, if we use a 32×32 tile, what is the reduction of memory bandwidth usage for input matrices M and N?
5.6. 假设启动的内核有 1,000 个线程块,每个线程块有 512 个线程。如果一个变量在内核中被声明为局部变量,那么在内核执行的生命周期中将创建多少个版本的变量?
5.6. Assume that a kernel is launched with 1,000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?
5.7. 在上一个问题中,如果一个变量被声明为共享内存变量,那么在内核执行的生命周期中会创建多少个版本的变量?
5.7. In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?
5.8. Explain the difference between shared memory and L1 cache.
5.9. 考虑对维度为N × N的两个输入矩阵执行矩阵乘法。在以下情况下,输入矩阵中的每个元素从全局内存请求多少次:
5.9. Consider performing a matrix multiplication of two input matrices with dimensions N×N. How many times is each element in the input matrices requested from global memory when:
b. Tiles of size T×T are used?
5.10. 每个线程的内核执行 36 次浮点运算和 7 次 32 位字全局内存访问。对于以下每个设备属性,指示该内核是受计算限制还是受内存限制。
5.10. A kernel performs 36 floating-point operations and 7 32-bit word global memory accesses per thread. For each of the following device properties, indicate whether this kernel is compute- or memory-bound.
A。 峰值 FLOPS=200 GFLOPS,峰值内存带宽=100 GB/s。
a. Peak FLOPS=200 GFLOPS, peak memory bandwidth=100 GB/s.
b. 峰值 FLOPS=300 GFLOPS,峰值内存带宽=250 GB/s。
b. Peak FLOPS=300 GFLOPS, peak memory bandwidth=250 GB/s.
5.11. 指示每个流式多处理器可以进行以下哪些分配。如果不可能,请指出限制因素。
5.11. Indicate which of the following assignments per streaming multiprocessor is possible. In the case where it is not possible, indicate the limiting factor(s).
A。 在计算能力 1.0 的设备上有 4 个块,每个块有 128 个线程,每个线程有 32 B 共享内存。
a. 4 blocks with 128 threads each and 32 B shared memory per thread on a device with compute capability 1.0.
b. 在计算能力 1.0 的设备上,8 个块,每个块有 128 个线程,每个线程有 16 B 共享内存。
b. 8 blocks with 128 threads each and 16 B shared memory per thread on a device with compute capability 1.0.
C。 在计算能力 1.0 的设备上,有 16 个块,每个块有 32 个线程,每个线程有 64 B 共享内存。
c. 16 blocks with 32 threads each and 64 B shared memory per thread on a device with compute capability 1.0.
d. 在计算能力为 1.2 的设备上有 2 个块,每个块有 512 个线程,每个线程有 32 B 共享内存。
d. 2 blocks with 512 threads each and 32 B shared memory per thread on a device with compute capability 1.2.
e. 在计算能力为 1.2 的设备上有 4 个块,每个块有 256 个线程,每个线程有 16 B 共享内存。
e. 4 blocks with 256 threads each and 16 B shared memory per thread on a device with compute capability 1.2.
F。 在计算能力为 1.2 的设备上有 8 个块,每个块有 256 个线程,每个线程有 8 B 共享内存。
f. 8 blocks with 256 threads each and 8 B shared memory per thread on a device with compute capability 1.2.
1有关全局内存的零复制访问,请参阅CUDA 编程指南。
1See the CUDA Programming Guide for zero-copy access to the global memory.
2此规则也有一些例外情况。如果所有访问都是使用常量索引值完成的,则编译器可能会决定将自动数组存储到寄存器中。
2There are some exceptions to this rule. The compiler may decide to store an automatic array into registers if all accesses are done with constant index values.
3请注意,如果线程块的数量小于 CUDA 设备中 SM 的数量,则可以使用 CUDA 内存防护来确保线程块之间的数据一致性。有关详细信息,请参阅CUDA 编程指南。
3Note that one can use CUDA memory fencing to ensure data coherence between thread blocks if the number of thread blocks is smaller than the number of SMs in the CUDA device. See the CUDA Programming Guide for more details.
6.1 扭曲和线程执行
6.1 Warps and Thread Execution
6.2 全局内存带宽
6.2 Global Memory Bandwidth
6.3 执行资源的动态分区
6.3 Dynamic Partitioning of Execution Resources
6.4 指令混合和线程粒度
6.4 Instruction Mix and Thread Granularity
6.5 概括
6.5 Summary
6.6 练习
6.6 Exercises
CUDA 内核的执行速度可能会根据所使用设备的资源限制而有很大差异。在本章中,我们将讨论 CUDA 设备中资源约束的主要类型以及它们如何影响该设备中的内核执行性能。为了实现他或她的目标,程序员通常必须找到方法来实现高于应用程序初始版本的所需性能水平。在不同的应用中,不同的约束可能占主导地位并成为限制因素。通过将一种资源使用换成另一种资源使用,可以提高特定 CUDA 设备上应用程序的性能,有时甚至可以显着提高。如果在应用策略之前减轻的资源约束实际上是主要约束,并且加剧的资源约束不会对并行执行产生负面影响,则该策略效果很好。如果没有这样的理解,性能调优就只是猜测而已。合理的策略可能会也可能不会导致性能提高。除了深入了解这些资源限制之外,本章还提供了原则和案例研究,旨在培养对可以实现高性能执行的算法模式类型的直觉。它还建立了习语和思想在您的性能调优过程中可能会带来良好的性能改进。
The execution speed of a CUDA kernel can vary greatly depending on the resource constraints of the device being used. In this chapter, we will discuss the major types of resource constraints in a CUDA device and how they can affect the kernel execution performance in this device. To achieve his or her goals, a programmer often has to find ways to achieve a required level of performance that is higher than that of an initial version of the application. In different applications, different constraints may dominate and become the limiting factors. One can improve the performance of an application on a particular CUDA device, sometimes dramatically, by trading one resource usage for another. This strategy works well if the resource constraint alleviated was actually the dominating constraint before the strategy was applied, and the one exacerbated does not have negative effects on parallel execution. Without such understanding, performance tuning would be guess work; plausible strategies may or may not lead to performance enhancements. Beyond insights into these resource constraints, this chapter further offers principles and case studies designed to cultivate intuition about the type of algorithm patterns that can result in high-performance execution. It is also establishes idioms and ideas that will likely lead to good performance improvements during your performance tuning efforts.
我们首先讨论线程执行中可能限制性能的一些方面。回想一下,启动 CUDA 内核会生成一个组织为两级层次结构的线程网格。在顶层,网格由 1D、2D 或 3D 块阵列组成。在底层,每个块又由 1D、2D 或 3D 线程数组组成。在第 4 章中,我们看到块可以按彼此之间的任何顺序执行,这允许 CUDA 内核并行执行中的透明可扩展性。然而,我们并没有过多谈论每个块内线程的执行时序。
Let’s first discuss some aspects of thread execution that can limit performance. Recall that launching a CUDA kernel generates a grid of threads that are organized as a two-level hierarchy. At the top level, a grid consists of a 1D, 2D, or 3D array of blocks. At the bottom level, each block, in turn, consists of a 1D, 2D, or 3D array of threads. In Chapter 4, we saw that blocks can execute in any order relative to each other, which allows for transparent scalability in parallel execution of CUDA kernels. However, we did not say much about the execution timing of threads within each block.
变形和 SIMD 硬件
Warps and SIMD Hardware
将线程作为 warp 执行的动机如下图所示(与图 5.4相同)。处理器只有一个控制单元来获取和解码指令。相同的控制信号发送到多个处理单元,每个处理单元执行扭曲中的一个线程。由于所有处理单元都由同一条指令控制,因此它们的执行差异是由于寄存器堆中的数据操作数值不同造成的。这在处理器设计中称为单指令多数据(SIMD)。例如,虽然所有处理单元都由指令控制
The motivation for executing threads as warps is illustrated in the following diagram (same as Figure 5.4). The processor has only one control unit that fetches and decodes instructions. The same control signal goes to multiple processing units, each of which executes one of the threads in a warp. Since all processing units are controlled by the same instruction, their execution differences are due to the different data operand values in the register files. This is called single instruction, multiple data (SIMD) in processor design. For example, although all processing units are controlled by an instruction
添加 r1、r2、r3
add r1, r2, r3
r2和r3值在不同的处理单元中是不同的。
the r2 and r3 values are different in different processing units.
现代处理器中的控制单元非常复杂,包括用于获取指令和访问指令存储器的端口的复杂逻辑。它们包括片上指令缓存,以减少指令获取的延迟。让多个处理单元共享一个控制单元可以显着降低硬件制造成本和功耗。
Control units in modern processors are quite complex, including sophisticated logic for fetching instructions and access ports to the instruction memory. They include on-chip instruction caches to reduce the latency of instruction fetch. Having multiple processing units share a control unit can result in significant reduction in hardware manufacturing cost and power consumption.
由于处理器的功耗越来越有限,新处理器可能会使用 SIMD 设计。事实上,未来我们可能会看到更多的处理单元共享一个控制单元。
As the processors are increasingly power-limited, new processors will likely use SIMD designs. In fact, we may see even more processing units sharing a control unit in the future.
从概念上讲,应该假设块中的线程可以按彼此之间的任何顺序执行。每当我们想要确保所有线程在任何线程开始下一阶段之前都完成了执行的公共阶段时,就应该使用屏障同步。执行内核的正确性不应依赖于某些线程彼此同步执行的事实。说到这里,我们还想指出,出于各种硬件成本的考虑,当前的CUDA设备实际上是捆绑了多个线程来执行的。这种实现策略导致某些类型的内核函数代码构造的性能限制。对于应用程序开发人员来说,将这些类型的构造更改为性能更好的其他等效形式是有利的。
Conceptually, one should assume that threads in a block can execute in any order with respect to each other. Barrier synchronizations should be used whenever we want to ensure all threads have completed a common phase of their execution before any of them start the next phase. The correctness of executing a kernel should not depend on the fact that certain threads will execute in synchrony with each other. Having said this, we also want to point out that due to various hardware cost considerations, current CUDA devices actually bundle multiple threads for execution. Such an implementation strategy leads to performance limitations for certain types of kernel function code constructs. It is advantageous for application developers to change these types of constructs to other equivalent forms that perform better.
正如我们在第 4 章中讨论的,当前的 CUDA 设备捆绑了多个执行线程。每个线程块都被划分为warp。 Warps 的执行由 SIMD 硬件实现(请参阅“Warps 和 SIMD 硬件”边栏)。这种实现技术有助于降低硬件制造成本、降低运行时操作电力成本,并在服务内存访问方面实现一些优化。在可预见的未来,我们预计扭曲分区仍将是一种流行的实现技术。然而,扭曲的大小很容易因实现的不同而不同。到目前为止,所有 CUDA 设备都使用类似的 warp 配置,其中每个 warp 由 32 个线程组成。
As we discussed in Chapter 4, current CUDA devices bundle several threads for execution. Each thread block is partitioned into warps. The execution of warps are implemented by an SIMD hardware (see “Warps and SIMD Hardware” sidebar). This implementation technique helps to reduce hardware manufacturing cost, lower runtime operation electricity cost, and enable some optimizations in servicing memory accesses. In the foreseeable future, we expect that warp partitioning will remain as a popular implementation technique. However, the size of a warp can easily vary from implementation to implementation. Up to this point in time, all CUDA devices have used similar warp configurations where each warp consists of 32 threads.
线程块根据线程索引划分为线程束。如果将线程块组织成一维数组(即仅使用threadIdx.x ),则分区很简单; warp 内的threadIdx.x值是连续且递增的。对于扭曲大小为 32,扭曲 0 从线程 0 开始,以线程 31 结束,扭曲 1 从线程 32 开始,以线程 63 结束。一般来说,扭曲n从线程 32× n开始,以线程 32( n +结束) 1)−1。对于大小不是 32 倍数的块,最后一个 warp 将用额外的线程填充以填充 32 个线程。例如,如果一个块有 48 个线程,它将被划分为两个 warp,并且其 warp 1 将用 16 个额外线程填充。
Thread blocks are partitioned into warps based on thread indices. If a thread block is organized into a 1D array (i.e., only threadIdx.x is used), the partition is straightforward; threadIdx.x values within a warp are consecutive and increasing. For a warp size of 32, warp 0 starts with thread 0 and ends with thread 31, warp 1 starts with thread 32 and ends with thread 63. In general, warp n starts with thread 32×n and ends with thread 32(n+1)−1. For a block of which the size is not a multiple of 32, the last warp will be padded with extra threads to fill up the 32 threads. For example, if a block has 48 threads, it will be partitioned into two warps, and its warp 1 will be padded with 16 extra threads.
对于由多个维度的线程组成的块,在划分为扭曲之前,维度将被投影为线性顺序。线性顺序是通过将y和z坐标较大的行放置在较小的行之后来确定的。也就是说,如果一个块由二维线程组成,则将所有threadIdx.y为1的线程放在threadIdx.y为0的线程之后,将threadIdx.y为2的线程放在那些线程之后,形成线性顺序。其中threadIdx.y为1,以此类推。
For blocks that consist of multiple dimensions of threads, the dimensions will be projected into a linear order before partitioning into warps. The linear order is determined by placing the rows with larger y and z coordinates after those with lower ones. That is, if a block consists of two dimensions of threads, one would form the linear order by placing all threads of which threadIdx.y is 1 after those of which threadIdx.y is 0, threads of which threadIdx.y is 2 after those of which threadIdx.y is 1, and so on.
图 6.1显示了将 2D 块的线程按线性顺序排列的示例。上半部分显示了块的 2D 视图。读者应该认识到与 C 语言中二维数组的行优先布局的相似性,如图4.3所示。每个线程显示为 T y,x,x为threadIdx.x,y为threadIdx.y。图 6.1的下半部分显示了块的线性视图。前4个线程是threadIdx.y值为0的线程;它们按照递增的threadIdx.x值排序。接下来的四个线程是threadIdx.y值为1的线程;它们还被放置为递增的threadIdx.x值。对于此示例,所有 16 根纱线形成半个经纱。经纱将用另外 16 个线程填充,以完成 32 线程经纱。想象一个具有 8×8 线程的 2D 块。 64 根线程将形成两个经纱。第一个扭曲从 T 0,0开始,以 T 3,7结束。第二个扭曲从 T 4,0开始,到 T 7,7结束。画出图作为练习将是一个有用的练习。
Figure 6.1 shows an example of placing threads of a 2D block into linear order. The upper part shows the 2D view of the block. Readers should recognize the similarity with the row-major layout of 2D arrays in C, as shown in Figure 4.3. Each thread is shown as Ty,x, x being threadIdx.x and y being threadIdx.y. The lower part of Figure 6.1 shows the linear view of the block. The first four threads are those threads of which the threadIdx.y value is 0; they are ordered with increasing threadIdx.x values. The next four threads are those threads of which the threadIdx.y value is 1; they are also placed with increasing threadIdx.x values. For this example, all 16 threads form half a warp. The warp will be padded with another 16 threads to complete a 32-thread warp. Imagine a 2D block with 8×8 threads. The 64 threads will form two warps. The first warp starts from T0,0 and ends with T3,7. The second warp starts with T4,0 and ends with T7,7. It would be a useful exercise to draw out the picture as an exercise.
图 6.1将 2D 螺纹按线性顺序排列。
Figure 6.1 Placing 2D threads into linear order.
对于3D块,我们首先将threadIdx.z值为0的所有线程放入线性顺序。在这些线程中,它们被视为一个 2D 块,如图6.1所示。所有threadIdx.z值为1的线程将被放入线性顺序,依此类推。对于 3D尺寸为 2×8×4 的线程块(x维度有 4 个,y维度有 8 个,z维度有 2 个),64 个线程将被划分为两个 warp,其中 T 0,0,0到 T 0 ,7,3在第一经线中,T 1,0,0到 T 1,7,3在第二经线中。
For a 3D block, we first place all threads of which the threadIdx.z value is 0 into the linear order. Among these threads, they are treated as a 2D block as shown in Figure 6.1. All threads of which the threadIdx.z value is 1 will then be placed into the linear order, and so on. For a 3D thread block of dimensions 2×8×4 (four in the x dimension, eight in the y dimension, and two in the z dimension), the 64 threads will be partitioned into two warps, with T0,0,0 through T0,7,3 in the first warp and T1,0,0 through T1,7,3 in the second warp.
SIMD 硬件将 warp 的所有线程作为一个包执行。一条指令针对同一经束中的所有线程运行。当扭曲中的所有线程在处理数据时遵循相同的执行路径(或更正式地称为控制流)时,它会很好地工作。例如,对于if-else构造,当所有线程都执行if部分或全部执行else部分时,执行效果良好。当 warp 内的线程采用不同的控制流路径时,SIMD 硬件将多次通过这些不同的路径。一次传递执行if部分后面的线程,另一遍执行else部分后面的线程。在每次传递期间,不允许遵循其他路径的线程生效。这些遍是彼此连续的,因此它们会增加执行时间。
The SIMD hardware executes all threads of a warp as a bundle. An instruction is run for all threads in the same warp. It works well when all threads within a warp follow the same execution path, or more formally referred to as control flow, when working their data. For example, for an if-else construct, the execution works well when either all threads execute the if part or all execute the else part. When threads within a warp take different control flow paths, the SIMD hardware will take multiple passes through these divergent paths. One pass executes those threads that follow the if part and another pass executes those that follow the else part. During each pass, the threads that follow the other path are not allowed to take effect. These passes are sequential to each other, thus they will add to the execution time.
发散扭曲执行的多通道方法扩展了 SIMD 硬件实现 CUDA 线程完整语义的能力。虽然硬件对 warp 中的所有线程执行相同的指令,但它有选择地让线程仅在每次传递中生效,从而允许每个线程采用自己的控制流路径。这保留了线程的独立性,同时利用了 SIMD 硬件成本降低的优势。
The multipass approach to divergent warp execution extends the SIMD hardware’s ability to implement the full semantics of CUDA threads. While the hardware executes the same instruction for all threads in a warp, it selectively lets the threads take effect in each pass only, allowing every thread to take its own control flow path. This preserves the independence of threads while taking advantage of the reduced cost of SIMD hardware.
当同一 warp 中的线程遵循不同的控制流路径时,我们称这些线程在执行中出现分歧。在if-else示例中,如果经束中的某些线程采用then路径而另一些线程采用else路径,则会出现分歧。发散的成本是硬件需要采取额外的传递以允许扭曲中的线程做出自己的决定。其他结构中也可能出现分歧;例如,如果 warp 中的线程执行for循环,则可以为不同的线程迭代六次、七次或八次。所有线程将一起完成前六次迭代。将使用两次迭代来执行第七次迭代,一次用于进行迭代的迭代,一次用于不进行迭代的迭代。将使用两次迭代来执行第八次迭代,一次用于进行迭代的迭代,一次用于不进行迭代的迭代。
When threads in the same warp follow different paths of control flow, we say that these threads diverge in their execution. In the if-else example, divergence arises if some threads in a warp take the then path and some the else path. The cost of divergence is the extra pass the hardware needs to take to allow the threads in a warp to make their own decisions. Divergence also can arise in other constructs; for example, if threads in a warp execute a for loop that can iterate six, seven, or eight times for different threads. All threads will finish the first six iterations together. Two passes will be used to execute the seventh iteration, one for those that take the iteration and one for those that do not. Two passes will be used to execute the eighth iteration, one for those that take the iteration and one for those that do not.
就源语句而言,当其决策条件基于threadIdx值时,控制构造可能会导致线程发散。例如,语句if (threadIdx.x > 2) {}导致线程遵循两个不同的控制流路径。线程 0、1 和 2 遵循与线程 3、4、5 等不同的路径。类似地,循环可能会导致线程如果其循环条件基于线程索引值,则发散。这种用法自然出现在一些重要的并行算法中。我们将使用归约算法来说明这一点。
In terms of source statements, a control construct can result in thread divergence when its decision condition is based on threadIdx values. For example, the statement if (threadIdx.x > 2) {} causes the threads to follow two divergent control flow paths. Threads 0, 1, and 2 follow a different path than threads 3, 4, 5, etc. Similarly, a loop can cause thread divergence if its loop condition is based on thread index values. Such usages arise naturally in some important parallel algorithms. We will use a reduction algorithm to illustrate this point.
归约算法从值数组中导出单个值。单个值可以是所有元素之间的总和、最大值、最小值等。所有这些类型的归约共享相同的计算结构。通过按顺序遍历数组的每个元素可以轻松完成缩减。当访问一个元素时,要采取的操作取决于所执行的归约类型。对于总和减少,当前步骤中访问的元素的值或当前值被添加到运行总和中。对于最大减少,将当前值与迄今为止访问的所有元素的运行最大值进行比较。如果当前值大于运行最大值,则当前元素值将变为运行最大值。对于最小减少,当前正在访问的元素的值与运行最小值进行比较。如果当前值小于运行最小值,则当前元素值变为运行最小值。当所有元素都被访问完时,顺序算法结束。顺序归约算法的工作效率很高,因为每个元素仅被访问一次,并且在访问每个元素时仅执行最少量的工作。其执行时间与涉及的元素数量成正比。也就是说,该算法的计算复杂度为O ( N ),其中N是参与归约的元素数量。
A reduction algorithm derives a single value from an array of values. The single value could be the sum, the maximal value, the minimal value, etc. among all elements. All these types of reductions share the same computation structure. A reduction can be easily done by sequentially going through every element of the array. When an element is visited, the action to take depends on the type of reduction being performed. For a sum reduction, the value of the element being visited at the current step, or the current value, is added to a running sum. For a maximal reduction, the current value is compared to a running maximal value of all the elements visited so far. If the current value is larger than the running maximal, the current element value becomes the running maximal value. For a minimal reduction, the value of the element currently being visited is compared to a running minimal. If the current value is smaller than the running minimal, the current element value becomes the running minimal. The sequential algorithm ends when all the elements are visited. The sequential reduction algorithm is work-efficient in that every element is only visited once and only a minimal amount of work is performed when each element is visited. Its execution time is proportional to the number of elements involved. That is, the computational complexity of the algorithm is O(N), where N is the number of elements involved in the reduction.
访问大型数组的所有元素所需的时间激发了并行执行。并行缩减算法通常类似于足球锦标赛的结构。事实上,世界杯的淘汰过程就是“最大”的缩减,最大的定义是“击败”所有其他球队的球队。比赛“减少”是通过多轮进行的。各队被分成两人一组。在第一轮比赛中,所有双人组并行比赛。第一轮获胜者晋级第二轮,第二轮获胜者晋级第三轮,以此类推。 16支球队参赛,第一轮获胜者为8名,第二轮获胜者为4名,第三轮获胜者为2名,以及第四轮的 1 名最终获胜者。应该很容易看出,即使有1024支球队,也只需要10轮就可以确定最终的获胜者。诀窍是要有足够的足球场来同时举办第一轮的512场比赛、第二轮的256场比赛、第三轮的128场比赛,以此类推。在场地足够的情况下,即使有6万支球队,我们也只需16轮就可以决出最后的胜者。当然,前提是要有足够的足球场和足够的官员来容纳第一轮的 30,000 场比赛,等等。
The time needed to visit all elements of a large array motivates parallel execution. A parallel reduction algorithm typically resembles the structure of a soccer tournament. In fact, the elimination process of the World Cup is a reduction of “maximal” where the maximal is defined as the team that “beats” all other teams. The tournament “reduction” is done by multiple rounds. The teams are divided into pairs. During the first round, all pairs play in parallel. Winners of the first round advance to the second round, the winners of which advance to the third round, etc. With 16 teams entering a tournament, 8 winners will emerge from the first round, 4 from the second round, 2 from the third round, and 1 final winner from the fourth round. It should be easy to see that even with 1,024 teams, it takes only 10 rounds to determine the final winner. The trick is to have enough soccer fields to hold the 512 games in parallel during the first round, 256 games in the second round, 128 games in the third round, and so on. With enough fields, even with 60,000 teams, we can determine the final winner in just 16 rounds. Of course, one would need to have enough soccer fields and enough officials to accommodate the 30,000 games in the first round, etc.
图 6.2显示了执行并行求和缩减的核函数。原始数组位于全局内存中。每个线程块通过将部分元素加载到共享内存中并执行并行缩减来缩减数组的一部分。为了简洁起见,图 6.2中省略了将元素从全局内存加载到共享内存的代码。减少是就地完成的,这意味着共享内存中的元素将被部分和替换。核函数中while循环的每次迭代都会实现一轮归约。while循环中的__syncthreads()语句(第 5 行)确保已生成前一迭代的所有部分和,因此所有线程都已准备好进入当前迭代,然后才允许其中任何一个线程这样做。这样,进入第二次迭代的所有线程都将使用第一次迭代中生成的值。第一轮结束后,偶数元素将被第一轮生成的部分和替换。第二轮结束后,索引为四的倍数的元素将被部分和替换。最后一轮结束后,整个部分的总和将位于元素 0 中。
Figure 6.2 shows a kernel function that performs parallel sum reduction. The original array is in the global memory. Each thread block reduces a section of the array by loading the elements of the section into the shared memory and performing parallel reduction. The code that loads the elements from global memory into the shared memory is omitted from Figure 6.2 for brevity. The reduction is done in place, which means the elements in the shared memory will be replaced by partial sums. Each iteration of the while loop in the kernel function implements a round of reduction. The __syncthreads() statement (line 5) in the while loop ensures that all partial sums for the previous iteration have been generated and thus all threads are ready to enter the current iteration before any one of them is allowed to do so. This way, all threads that enter the second iteration will be using the values produced in the first iteration. After the first round, the even elements will be replaced by the partial sums generated in the first round. After the second round, the elements of which the indices are multiples of four will be replaced with the partial sums. After the final round, the total sum of the entire section will be in element 0.
图 6.2一个简单的求和缩减内核。
Figure 6.2 A simple sum reduction kernel.
在图 6.2中,第 3 行将 stride 变量初始化为 1。在第一次迭代期间,第 6 行中的if语句用于仅选择偶数线程来执行两个相邻元素之间的加法。内核的执行如图6.3所示。线程和数组元素值以水平方向显示。线程所进行的迭代以垂直方向显示,时间从上到下进行。图 6.3的每一行显示了for循环迭代后数组元素的内容。
In Figure 6.2, line 3 initializes the stride variable to 1. During the first iteration, the if statement in line 6 is used to select only the even threads to perform addition between two neighboring elements. The execution of the kernel is illustrated in Figure 6.3. The threads and the array element values are shown in the horizontal direction. The iterations taken by the threads are shown in the vertical direction with time progressing from top to bottom. Each row of Figure 6.3 shows the contents of the array elements after an iteration of the for loop.
图 6.3求和核的执行。
Figure 6.3 Execution of the sum reduction kernel.
如图6.3所示,数组的偶数元素保存第 1 次迭代后的成对部分和。在第二次迭代之前,stride变量的值加倍为 2。在第二次迭代期间,仅那些索引为四的倍数将执行第 8 行中的add语句。每个线程生成包含四个元素的部分和,如第 2 行所示。每个部分有 512 个元素,内核函数将在九次迭代后生成整个部分的和。通过使用blockDim.x作为第 4 行中的循环绑定,内核假定它是使用与该部分中的元素数量相同数量的线程启动的。也就是说,对于 512 的部分大小,需要使用 512 个线程启动内核。1
As shown in Figure 6.3, the even elements of the array hold the pairwise partial sums after iteration 1. Before the second iteration, the value of the stride variable is doubled to 2. During the second iteration, only those threads of which the indices are multiples of four will execute the add statement in line 8. Each thread generates a partial sum that includes four elements, as shown in row 2. With 512 elements in each section, the kernel function will generate the sum of the entire section after nine iterations. By using blockDim.x as the loop bound in line 4, the kernel assumes that it is launched with the same number of threads as the number of elements in the section. That is, for a section size of 512, the kernel needs to be launched with 512 threads.1
我们来分析一下内核完成的总工作量。假设要减少的元素总数为N。第一轮需要N /2次加法。第二轮需要N /4个加法。最后一轮只有一个补充。有 log 2 ( N ) 轮。内核执行的加法总数为N /2+ N /4+ N /8+…+1= N −1。因此,约简算法的计算复杂度为O ( N )。该算法工作效率高。然而,我们还需要确保在执行内核时有效地利用硬件。
Let’s analyze the total amount of work done by the kernel. Assume that the total number of elements to be reduced is N. The first round requires N/2 additions. The second round requires N/4 additions. The final round has only one addition. There are log2(N) rounds. The total number of additions performed by the kernel is N/2+N/4+N/8+…+1=N−1. Therefore, the computational complexity of the reduction algorithm is O(N). The algorithm is work-efficient. However, we also need to make sure that the hardware is efficiently utilized while executing the kernel.
图 6.2中的内核明显存在线程分歧。在循环的第一次迭代期间,只有那些threadIdx.x为偶数的线程才会执行add语句。需要一次传递来执行这些线程,并且需要额外一次传递来执行那些不执行第 8 行的线程。在每次连续迭代中,执行第 8 行的线程会减少,但仍需要两次传递来执行期间的所有线程。每次迭代。通过稍微改变算法就可以减少这种差异。
The kernel in Figure 6.2 clearly has thread divergence. During the first iteration of the loop, only those threads of which the threadIdx.x are even will execute the add statement. One pass will be needed to execute these threads and one additional pass will be needed to execute those that do not execute line 8. In each successive iteration, fewer threads will execute line 8 but two passes will be still needed to execute all the threads during each iteration. This divergence can be reduced with a slight change to the algorithm.
图 6.4显示了修改后的内核,其求和缩减算法略有不同。它不是在第一轮中添加相邻元素,而是添加彼此相距半个截面的元素。它通过将步幅初始化为该部分大小的一半来实现这一点。在第一轮中添加的所有对彼此之间的距离是部分大小的一半。第一次迭代后,所有成对和都存储在数组的前半部分。该循环在进入下一次迭代之前将步幅除以 2。因此,对于第二次迭代,步幅变量值是部分大小的四分之一,也就是说,线程在第二次迭代期间添加彼此相距四分之一部分的元素。
Figure 6.4 shows a modified kernel with a slightly different algorithm for sum reduction. Instead of adding neighbor elements in the first round, it adds elements that are half a section away from each other. It does so by initializing the stride to be half the size of the section. All pairs added during the first round are half the section size away from each other. After the first iteration, all the pairwise sums are stored in the first half of the array. The loop divides the stride by 2 before entering the next iteration. Thus, for the second iteration, the stride variable value is one-quarter of the section size—that is, the threads add elements that are one-quarter a section away from each other during the second iteration.
图 6.4线程分歧较少的内核。
Figure 6.4 A kernel with fewer thread divergence.
请注意,图 6.4中的内核在循环中仍然有一个if语句(第 6 行)。每次迭代中执行第 7 行的线程数量与图 6.2中的相同。那么,为什么两个内核之间会有性能差异呢?答案在于执行第 7 行的线程相对于不执行第 7 行的线程的位置。
Note that the kernel in Figure 6.4 still has an if statement (line 6) in the loop. The number of threads that execute line 7 in each iteration is the same as in Figure 6.2. So, why should there be a performance difference between the two kernels? The answer lies in the positions of threads that execute line 7 relative to those that do not.
图 6.5说明了修改后的内核的执行情况。在第一次迭代期间,threadIdx.x值小于该节大小一半的所有线程都执行第 7 行。对于 512 个元素的节,线程 0–255在第一次迭代期间执行add语句而线程 256-511 则不然。第一次迭代后,成对的和存储在元素 0-255 中。由于 warps 由 32 个具有连续threadIdx.x值的线程组成,warps 0-7 中的所有线程都执行 add 语句,而 warps 8-15 都跳过 add 语句。由于每个经纱中的所有线程都采用相同的路径,因此不存在线程分歧!
Figure 6.5 illustrates the execution of the revised kernel. During the first iteration, all threads of which the threadIdx.x values are less than half of the size of the section execute line 7. For a section of 512 elements, threads 0–255 execute the add statement during the first iteration while threads 256–511 do not. The pairwise sums are stored in elements 0–255 after the first iteration. Since the warps consist of 32 threads with consecutive threadIdx.x values, all threads in warps 0–7 execute the add statement, whereas warps 8–15 all skip the add statement. Since all threads in each warp take the same path, there is no thread divergence!
图 6.5修改后算法的执行。
Figure 6.5 Execution of the revised algorithm.
图 6.4中的内核并没有完全消除if语句导致的分歧。读者应该验证从第五次迭代开始,执行第 7 行的线程数量将降至 32 以下。也就是说,最后五次迭代将只有 16、8、4、2 和 1 个线程执行加法。这意味着内核执行在这些迭代中仍然会存在分歧。然而,有分歧的循环迭代次数从 10 次减少到 5 次。
The kernel in Figure 6.4 does not completely eliminate the divergence due to the if statement. Readers should verify that starting with the fifth iteration, the number of threads that execute line 7 will fall below 32. That is, the final five iterations will have only 16, 8, 4, 2, and 1 thread(s) performing the addition. This means that the kernel execution will still have divergence in these iterations. However, the number of iterations of the loop that has divergence is reduced from 10 to 5.
CUDA 内核性能最重要的因素之一是访问全局内存中的数据。 CUDA 应用程序利用海量数据并行性。当然,CUDA 应用程序倾向于在短时间内处理全局内存中的大量数据。在第 5 章中,我们讨论了利用共享内存来减少线程块中的线程集合必须访问的数据总量的平铺技术。在本章中,我们将进一步讨论内存合并技术可以更有效地将数据从全局内存移动到共享内存和寄存器中。内存合并技术通常与切片技术结合使用,以允许 CUDA 设备通过更有效地利用全局内存带宽来发挥其性能潜力。2
One of the most important factors of CUDA kernel performance is accessing data in the global memory. CUDA applications exploit massive data parallelism. Naturally, CUDA applications tend to process a massive amount of data from the global memory within a short period of time. In Chapter 5, we discussed tiling techniques that utilize shared memories to reduce the total amount of data that must be accessed by a collection of threads in the thread block. In this chapter, we will further discuss memory coalescing techniques that can more effectively move data from the global memory into shared memories and registers. Memory coalescing techniques are often used in conjunction with tiling techniques to allow CUDA devices to reach their performance potential by more efficiently utilizing the global memory bandwidth.2
CUDA 设备的全局内存是通过 DRAM 实现的。数据位存储在 DRAM 单元中,这些单元是小型电容器,其中存在或不存在微小电荷可区分 0 和 1。从 DRAM 单元读取数据需要小型电容器使用其微小电荷来驱动高电压。电容线通向传感器并启动其检测机制,确定电容器中是否存在足够的电荷量以符合“1”的条件(请参阅“为什么 DRAM 如此慢?”侧边栏)。在现代 DRAM 芯片中,这个过程需要数十纳秒。由于相对于所需的数据访问速度(每字节亚纳秒访问)而言,这是一个非常慢的过程,因此现代 DRAM 使用并行性来提高数据访问速率。
The global memory of a CUDA device is implemented with DRAMs. Data bits are stored in DRAM cells that are small capacitors, where the presence or absence of a tiny amount of electrical charge distinguishes between 0 and 1. Reading data from a DRAM cell requires the small capacitor to use its tiny electrical charge to drive a highly capacitive line leading to a sensor and set off its detection mechanism that determines whether a sufficient amount of charge is present in the capacitor to qualify as a “1” (see “Why Are DRAMs So Slow?” sidebar). This process takes tens of nanoseconds in modern DRAM chips. Because this is a very slow process relative to the desired data access speed (sub-nanosecond access per byte), modern DRAMs use parallelism to increase their rate of data access.
每次访问 DRAM 位置时,实际上都会访问包括所请求位置的许多连续位置。每个 DRAM 芯片中都设有许多传感器,它们并行工作。每个都感测这些连续位置内的一位的内容。一旦被传感器检测到,来自所有这些连续位置的数据就可以以非常高的速度传输到处理器。如果应用程序可以集中使用来自连续位置的数据,那么 DRAM 可以以比访问真正随机的位置序列高得多的速率提供数据。
Each time a DRAM location is accessed, many consecutive locations that include the requested location are actually accessed. Many sensors are provided in each DRAM chip and they work in parallel. Each senses the content of a bit within these consecutive locations. Once detected by the sensors, the data from all these consecutive locations can be transferred at very high speed to the processor. If an application can make focused use of data from consecutive locations, the DRAMs can supply the data at a much higher rate than if a truly random sequence of locations were accessed.
为什么 DRAM 这么慢?
Why Are DRAMs So Slow?
下图显示了 DRAM 单元以及访问其内容的路径。解码器是一种电子电路,使用晶体管驱动连接到数千个单元出口门的线路。线路可能需要很长时间才能完全充电或放电到所需的水平。
The following figure shows a DRAM cell and the path for accessing its content. The decoder is an electronic circuit that uses a transistor to drive a line connected to the outlet gates of thousands of cells. It can take a long time for the line to be fully charged or discharged to the desired level.
对于单元来说,更艰巨的挑战是驱动连接至感测放大器的线路并允许感测放大器检测其内容。这是基于电荷共享。栅极释放出细胞中存储的微量电荷。如果单元内容为“1”,那么微小的电荷量必然会提高由长位线和读出放大器的输入形成的大电容的电势。一个很好的类比是,某人在长走廊的一端拿着一小杯咖啡,让另一个人闻通过走廊传播的香气来确定咖啡的味道。
A more formidable challenge is for the cell to drive the line to the sense amplifiers and allow the sense amplifier to detect its content. This is based on electrical charge sharing. The gate lets out the tiny amount of electrical charge stored in the cell. If the cell content is “1,” the tiny amount of charge must raise the potential of the large capacitance formed by the long bit line and the input of the sense amplifier. A good analogy would be for someone to hold a small cup of coffee at one end of a long hallway for another person to smell the aroma propagated through the hallway to determine the flavor of the coffee.
人们可以通过在每个电池中使用更大、更强的电容器来加速这一过程。然而,DRAM 一直在朝着相反的方向发展。随着时间的推移,每个单元中的电容器的尺寸都在稳步减小,以便每个芯片中可以存储更多的位。这就是为什么DRAMS的访问延迟并没有随着时间的推移而减少。
One could speed up the process by using a larger, stronger capacitor in each cell. However, the DRAMs have been going in the opposite direction. The capacitors in each cell have been steadily reduced in size over time so that more bits can be stored in each chip. This is why the access latency of DRAMS has not decreased over time.
认识到现代 DRAM 的组织方式,当前的 CUDA 设备采用了一种技术,允许程序员通过将线程的内存访问组织成有利的模式来实现高全局内存访问效率。该技术利用了扭曲中的线程在任何给定时间点执行相同指令的事实。当 warp 中的所有线程执行加载指令时,硬件会检测它们是否访问连续的全局内存位置。也就是说,当线程束中的所有线程访问连续的全局内存位置时,实现最有利的访问模式。在这种情况下,硬件将所有这些访问组合或合并为对连续 DRAM 位置的合并访问。例如,对于给定的 warp 加载指令,如果线程 0 访问全局内存位置N、3 个线程 1 位置N +1 、线程 2 位置N +2 等等,则所有这些访问将被合并或组合成访问 DRAM 时对连续位置的单个请求。这种联合访问允许 DRAM 以接近峰值全局内存带宽的速率传送数据。
Recognizing the organization of modern DRAMs, current CUDA devices employ a technique that allows the programmers to achieve high global memory access efficiency by organizing memory accesses of threads into favorable patterns. This technique takes advantage of the fact that threads in a warp execute the same instruction at any given point in time. When all threads in a warp execute a load instruction, the hardware detects whether they access consecutive global memory locations. That is, the most favorable access pattern is achieved when all threads in a warp access consecutive global memory locations. In this case, the hardware combines, or coalesces, all these accesses into a consolidated access to consecutive DRAM locations. For example, for a given load instruction of a warp, if thread 0 accesses global memory location N,3 thread 1 location N+1, thread 2 location N+2, and so on, all these accesses will be coalesced, or combined into a single request for consecutive locations when accessing the DRAMs. Such coalesced access allows the DRAMs to deliver data at a rate close to the peak global memory bandwidth.
为了理解如何有效地使用合并硬件,我们需要回顾一下在访问 C 时内存地址是如何形成的多维数组元素。正如我们在第 4 章中所示(图 4.3,为了方便起见,复制为图 6.6),C 和 CUDA 中的多维数组元素根据行优先约定放置到线性寻址内存空间中。也就是说,矩阵第 0 行的元素首先按顺序放置到连续的位置。它们后面是矩阵第 1 行的元素,依此类推。换句话说,一行中的所有元素都被放置到连续的位置,并且整行被一个接一个地放置。术语“行主”是指数据的放置保留了行的结构:行中的所有相邻元素都被放置到地址空间中的连续位置。图 6.6显示了一个小示例,其中 4×4 矩阵 M 的 16 个元素被放置到线性寻址位置。第 0 行的四个元素首先按照其在该行中出现的顺序放置。然后放置第 1 行中的元素,然后是第 2 行的元素,最后是第 3 行的元素。应该清楚的是,尽管 M 0,0和 M 1,0在 2D 矩阵中看起来是连续的,但它们是被放置的线性寻址存储器中的四个位置相距。
To understand how to effectively use coalescing hardware, we need to review how the memory addresses are formed in accessing C multidimensional array elements. As we showed in Chapter 4 (Figure 4.3, replicated as Figure 6.6 for convenience), multidimensional array elements in C and CUDA are placed into the linearly addressed memory space according to the row-major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations. They are followed by the elements of row 1 of the matrix, and so on. In other words, all elements in a row are placed into consecutive locations and entire rows are placed one after another. The term row major refers to the fact that the placement of data preserves the structure of rows: all adjacent elements in a row are placed into consecutive locations in the address space. Figure 6.6 shows a small example where the 16 elements of a 4×4 matrix M are placed into linearly addressed locations. The four elements of row 0 are first placed in their order of appearance in the row. Elements in row 1 are then placed, followed by elements of row 2, followed by elements of row 3. It should be clear that M0,0 and M1,0, though they appear to be consecutive in the 2D matrix, are placed four locations away in the linearly addressed memory.
图 6.6将矩阵元素按线性顺序排列。
Figure 6.6 Placing matrix elements into linear order.
图 6.7说明了内存合并的有利与不利的 CUDA 内核 2D 行主数组数据访问模式。回想一下图 4.7 ,在我们的简单矩阵-矩阵乘法内核中,每个线程访问d_M数组的一行和d_N数组的一列。读者在继续之前应先阅读第 4.3 节。图 6.7(a)说明了d_M数组的数据访问模式,其中 warp 中的线程读取相邻行。也就是说,在迭代 0 期间,warp 中的线程读取第 0-31 行的元素 0。在迭代 1 期间,这些相同的线程读取第 0-31 行的元素 1。所有访问都不会被合并。更有利的访问模式如图 6.7(b)所示,其中每个线程读取一列d_N 。在迭代 0 期间,warp 0 中的线程读取列 0–31 的元素 1。所有这些访问都将被合并。
Figure 6.7 illustrates the favorable versus unfavorable CUDA kernel 2D row-major array data access patterns for memory coalescing. Recall from Figure 4.7 that in our simple matrix–matrix multiplication kernel, each thread accesses a row of the d_M array and a column of the d_N array. Readers should review Section 4.3 before continuing. Figure 6.7(a) illustrates the data access pattern of the d_M array, where threads in a warp read adjacent rows. That is, during iteration 0, threads in a warp read element 0 of rows 0–31. During iteration 1, these same threads read element 1 of rows 0–31. None of the accesses will be coalesced. A more favorable access pattern is shown in Figure 6.7(b), where each thread reads a column of d_N. During iteration 0, threads in warp 0 read element 1 of columns 0–31. All these accesses will be coalesced.
图 6.7 C 2D 数组中用于合并的内存访问模式。
Figure 6.7 Memory access patterns in C 2D arrays for coalescing.
为了理解为什么图 6.7(b)中的模式比图 6.7(a)中的模式更有利,我们需要更详细地回顾一下如何访问这些矩阵元素。图 6.8显示了访问 4×4 矩阵时有利的访问模式的一个小示例。图 6.8顶部的箭头显示了一个线程的内核代码的访问模式。该访问模式是通过访问图 4.7中的d_N生成的:
To understand why the pattern in Figure 6.7(b) is more favorable than that in Figure 6.7(a), we need to review how these matrix elements are accessed in more detail. Figure 6.8 shows a small example of the favorable access pattern in accessing a 4×4 matrix. The arrow in the top portion of Figure 6.8 shows the access pattern of the kernel code for one thread. This access pattern is generated by the access to d_N in Figure 4.7:
图 6.8合并访问模式。
Figure 6.8 A coalesced access pattern.
在k循环的给定迭代内,所有线程的k*Width值都相同。回想一下Col = blockIdx.x*blockDim.x + threadIdx.x。由于blockIndx.x和blockDim.x的值对于同一块中的所有线程具有相同的值,因此k*Width+Col在线程块中唯一变化的部分是threadIdx.x。例如,在图 6.8中,假设我们使用 4×4 块并且扭曲大小为 4。也就是说,对于这个玩具示例,我们仅使用一个块来计算整个 P 矩阵。对于块中的所有线程, Width、blockDim.x和blockIdx.x的值分别为 4、4 和 0。在迭代0中,k值为0。每个线程访问d_N使用的索引为
Within a given iteration of the k loop, the k∗Width value is the same across all threads. Recall that Col = blockIdx.x∗blockDim.x + threadIdx.x. Since the value of blockIndx.x and blockDim.x are of the same value for all threads in the same block, the only part of k∗Width+Col that varies across a thread block is threadIdx.x. For example, in Figure 6.8, assume that we are using 4×4 blocks and that the warp size is 4. That is, for this toy example, we are using only one block to calculate the entire P matrix. The values of Width, blockDim.x, and blockIdx.x are 4, 4, and 0, respectively, for all threads in the block. In iteration 0, the k value is 0. The index used by each thread for accessing d_N is
d_N[k*Width+Col]=d_N[k*Width+blockIdx.x*blockDim.x+threadIdx.x]
d_N[k∗Width+Col]=d_N[k∗Width+blockIdx.x∗blockDim.x+threadIdx.x]
= d_N[0*4 + 0*4 + threadidx.x]
= d_N[0∗4 + 0∗4 + threadidx.x]
= d_N[线程Idx.x]
= d_N[threadIdx.x]
也就是说,访问d_N 的索引就是threadIdx.x的值。 T 0、T 1、T 2和T 3访问的d_N元素分别是d_N[0]、d_N[1]、d_N[2]和d_N[3]。图 6.8的“加载迭代 0”框对此进行了说明。这些元素位于全局内存中的连续位置。硬件检测到这些访问是由线程束中的线程对全局内存中的连续位置进行的。它将这些访问合并为一个合并的访问。这使得 DRAM 能够以高速率提供数据。
That is, the index for accessing d_N is simply the value of threadIdx.x. The d_N elements accessed by T0, T1, T2, and T3 are d_N[0], d_N[1], d_N[2], and d_N[3], respectively. This is illustrated with the “Load iteration 0” box of Figure 6.8. These elements are in consecutive locations in the global memory. The hardware detects that these accesses are made by threads in a warp and to consecutive locations in the global memory. It coalesces these accesses into a consolidated access. This allows the DRAMs to supply data at a high rate.
在下一次迭代中,k值为1。每个线程访问d_N所使用的索引变为
During the next iteration, the k value is 1. The index used by each thread for accessing d_N becomes
d_N[k*Width+Col]=d_N[k*Width+blockIdx.x*blockDim.x+threadIdx.x]
d_N[k∗Width+Col]=d_N[k∗Width+blockIdx.x∗blockDim.x+threadIdx.x]
= d_N[1*4 + 0*4 + threadidx.x]
= d_N[1∗4 + 0∗4 + threadidx.x]
= d_N[4+threadIdx.x]
= d_N[4+threadIdx.x]
T 0、 T 1、 T 2和 T 3访问的d_N元素分别是d_N[5]、d_N[6]、d_N[7]和d_N[8],如“加载迭代 1”所示图 6.8中的方框。所有这些访问再次合并为一个综合访问,以提高 DRAM 带宽利用率。
The d_N elements accessed by T0, T1, T2, and T3 are d_N[5], d_N[6], d_N[7], and d_N[8], respectively, as shown with the “Load iteration 1” box in Figure 6.8. All these accesses are again coalesced into a consolidated access for improved DRAM bandwidth utilization.
图 6.9显示了未合并的矩阵数据访问模式的示例。该图顶部的箭头显示每个线程的内核代码按顺序访问一行的元素。图 6.9顶部的箭头显示了一个线程的内核代码的访问模式。该访问模式是通过访问图 4.7中的d_M生成的:
Figure 6.9 shows an example of a matrix data access pattern that is not coalesced. The arrow in the top portion of the figure shows that the kernel code for each thread accesses elements of a row in sequence. The arrow in the top portion of Figure 6.9 shows the access pattern of the kernel code for one thread. This access pattern is generated by the access to d_M in Figure 4.7:
图 6.9未合并的访问模式。
Figure 6.9 An uncoalesced access pattern.
d_M[行*宽度 + k]
d_M[Row∗Width + k]
在k循环的给定迭代内,所有线程的k*Width值都相同。回想一下Row = blockIdx.y*blockDim.y + 线程Idx.y。由于blockIndx.y和blockDim.y的值对于同一块中的所有线程具有相同的值,因此Row*Width+k唯一可以在线程块中变化的部分是threadIdx.y。在图 6.9中,再次假设我们使用 4×4 块,并且扭曲大小为 4。对于该块中的所有线程, Width、blockDim.y和blockIdx.y的值分别为 4、4 和 0。堵塞。在迭代0中,k值为0。每个线程访问d_N使用的索引为
Within a given iteration of the k loop, the k∗Width value is the same across all threads. Recall that Row = blockIdx.y∗blockDim.y + threadIdx.y. Since the value of blockIndx.y and blockDim.y are of the same value for all threads in the same block, the only part of Row∗Width+k that can vary across a thread block is threadIdx.y. In Figure 6.9, assume again that we are using 4×4 blocks and that the warp size is 4. The values of Width, blockDim.y, and blockIdx.y are 4, 4, and 0, respectively, for all threads in the block. In iteration 0, the k value is 0. The index used by each thread for accessing d_N is
d_M[行*宽度+k] = d_M[(blockIdx.y*blockDim.y+threadIdx.y)*宽度+k]
d_M[Row∗Width+k] = d_M[(blockIdx.y∗blockDim.y+threadIdx.y)∗Width+k]
= d_M[((0*4+threadIdx.y)*4 + 0]
= d_M[((0∗4+threadIdx.y)∗4 + 0]
= d_M[threadIdx.x*4]
= d_M[threadIdx.x∗4]
也就是说,访问d_M的索引就是threadIdx.x*4的值。 T 0、T 1、T 2和T 3访问的d_M元素是d_M[0]、d_M[4]、d_M[8]和d_M[12]。图 6.9的“加载迭代 0”框对此进行了说明。这些元素不在全局内存中连续的位置。硬件无法将这些访问合并为合并的访问。
That is, the index for accessing d_M is simply the value of threadIdx.x∗4. The d_M elements accessed by T0, T1, T2, and T3 are d_M[0], d_M[4], d_M[8], and d_M[12]. This is illustrated with the “Load iteration 0” box of Figure 6.9. These elements are not in consecutive locations in the global memory. The hardware cannot coalesce these accesses into a consolidated access.
在下一次迭代中,k值为1。每个线程访问d_M所使用的索引变为
During the next iteration, the k value is 1. The index used by each thread for accessing d_M becomes
d_M[行*宽度+k] = d_M[(blockIdx.y*blockDim.y+threadIdx.y)*宽度+k]
d_M[Row∗Width+k] = d_M[(blockIdx.y∗blockDim.y+threadIdx.y)∗Width+k]
= d_M[(0*4+threadidx.x)*4+1]
= d_M[(0∗4+threadidx.x)∗4+1]
= d_M[threadIdx.x*4+1]
= d_M[threadIdx.x∗4+1]
T 0、 T 1、 T 2、 T 3访问的d_M元素分别为d_M[1]、d_M[5]、d_M[9]和d_M[13],如“加载迭代 1”框所示如图 6.9所示。所有这些访问再次无法合并为合并访问。
The d_M elements accessed by T0, T1, T2, T3 are d_M[1], d_M[5], d_M[9], and d_M[13], respectively, as shown with the “Load iteration 1” box in Figure 6.9. All these accesses again cannot be coalesced into a consolidated access.
对于实际矩阵,每个维度通常有数百甚至数千个元素。相邻线程在每次迭代中访问的元素可能相距数百甚至数千个元素。底部的“加载迭代 0”框显示了线程如何在 0 迭代中访问这些不连续的位置。硬件将确定对这些元素的访问彼此相距较远并且无法合并。因此,当内核循环迭代行时,对全局内存的访问效率比内核迭代列的情况低得多。
For a realistic matrix, there are typically hundreds or even thousands of elements in each dimension. The elements accessed in each iteration by neighboring threads can be hundreds or even thousands of elements apart. The “Load iteration 0” box in the bottom portion shows how the threads access these nonconsecutive locations in the 0 iteration. The hardware will determine that accesses to these elements are far away from each other and cannot be coalesced. As a result, when a kernel loop iterates through a row, the accesses to global memory are much less efficient than the case where a kernel iterates through a column.
如果算法本质上需要内核代码沿行方向迭代数据,则可以使用共享内存来启用内存合并。图 6.10说明了矩阵乘法的技术。每个线程从d_M读取一行,这是一种不能被读取的模式合并。幸运的是,可以使用平铺算法来启用合并。正如我们在第 5 章中讨论的,块的线程可以首先协作地将图块加载到共享内存中。必须小心确保这些瓷砖以合并的模式加载。一旦数据位于共享存储器中,就可以按行或列进行访问,性能变化要小得多,因为共享存储器被实现为本质上高速的片上存储器,不需要合并来实现高数据访问率。
If an algorithm intrinsically requires a kernel code to iterate through data along the row direction, one can use the shared memory to enable memory coalescing. The technique is illustrated in Figure 6.10 for matrix multiplication. Each thread reads a row from d_M, a pattern that cannot be coalesced. Fortunately, a tiled algorithm can be used to enable coalescing. As we discussed in Chapter 5, threads of a block can first cooperatively load the tiles into the shared memory. Care must be taken to ensure that these tiles are loaded in a coalesced pattern. Once the data is in shared memory, it can be accessed either on a row basis or a column basis with much less performance variation because the shared memories are implemented as intrinsically high-speed, on-chip memory that does not require coalescing to achieve a high data access rate.
图 6.10使用共享内存来启用合并。
Figure 6.10 Using shared memory to enable coalescing.
我们将图 5.7复制为图 6.11,其中矩阵乘法内核将两个矩阵d_M和d_N加载到共享内存中。请注意,线程块中的每个线程负责将一个d_M元素和一个d_N元素加载到每个阶段中的Mds和Nds中,如第 8 行中的for循环所定义。回想一下,每个图块中涉及TILE_WIDTH 2 个线程。线程使用threadIdx.y和threadIdx.y来确定要加载的每个矩阵的元素。
We replicate Figure 5.7 here as Figure 6.11, where the matrix multiplication kernel loads two tiles of matrix d_M and d_N into the shared memory. Note that each thread in a thread block is responsible for loading one d_M element and one d_N element into Mds and Nds in each phase as defined by the for loop in line 8. Recall that there are TILE_WIDTH2 threads involved in each tile. The threads use threadIdx.y and threadIdx.y to determine the element of each matrix to load.
图 6.11使用共享内存的平铺矩阵乘法内核。
Figure 6.11 Tiled matrix multiplication kernel using shared memory.
d_M元素在第 9 行加载,其中每个线程的索引计算使用 m 来定位图块的左端。然后,图块的每一行都由TILE_WIDTH线程加载,这些线程的threadIdx在x维度上不同。由于这些线程具有连续的threadIdx.x值,因此它们位于相同的扭曲。此外,索引计算d_M[row][m*TILE_SIZE+tx]使这些线程访问同一行中的元素。问题是扭曲中的相邻线程是否确实访问了行中的相邻元素。回想一下,同一行中的元素被放置到全局内存的连续位置中。由于列索引m*TILE_SIZE+tx使得所有具有相邻tx值的线程都将访问相邻的行元素,所以答案是肯定的。硬件检测到同一 warp 中的这些线程访问全局内存中的连续位置,并将它们组合成合并访问。
The d_M elements are loaded in line 9, where the index calculation for each thread uses m to locate the left end of the tile. Each row of the tile is then loaded by TILE_WIDTH threads of which the threadIdx differ in the x dimension. Since these threads have consecutive threadIdx.x values, they are in the same warp. Also, the index calculation d_M[row][m∗TILE_SIZE+tx] makes these threads access elements in the same row. The question is whether adjacent threads in the warp indeed access adjacent elements in the row. Recall that elements in the same row are placed into consecutive locations of the global memory. Since the column index m∗TILE_SIZE+tx is such that all threads with adjacent tx values will access adjacent row elements, the answer is yes. The hardware detects that these threads in the same warp access consecutive locations in the global memory and combines them into a coalesced access.
在d_N的情况下,行索引m*TILE_SIZE+ty对于同一 warp 中的所有线程具有相同的值;它们都具有相同的ty值。因此,同一 warp 中的线程访问同一行。问题是扭曲中的相邻线程是否访问行的相邻元素。请注意,每个线程Col的列索引计算基于bx*TILE_SIZE+tx(参见第 4 行)。因此,线程束中的相邻线程访问行中的相邻元素。硬件检测到同一 warp 中的这些线程访问全局内存中的连续位置,并将它们组合成合并访问。
In the case of d_N, the row index m∗TILE_SIZE+ty has the same value for all threads in the same warp; they all have the same ty value. Thus, threads in the same warp access the same row. The question is whether the adjacent threads in a warp access adjacent elements of a row. Note that the column index calculation for each thread Col is based on bx∗TILE_SIZE+tx (see line 4). Therefore, adjacent threads in a warp access adjacent elements in a row. The hardware detects that these threads in the same warp access consecutive locations in the global memory and combine them into a coalesced access.
读者会发现,基于图 6.11中的内核代码绘制一幅图并识别加载图块每个元素的线程的threadIdx.y和threadIdx.x值非常有用。图 6.11中的第 5、6、9 和 10 行构成了一种常用的编程模式,用于在平铺算法中将矩阵元素加载到共享内存中。我们还想鼓励读者通过第 12 行和第 13 行中的点积循环来分析数据访问模式。请注意,warp 中的线程不会访问Mds的连续位置。这不是问题,因为Mds位于共享内存中,不需要合并即可实现高速数据访问。
Readers shall find it useful to draw a picture based on the kernel code in Figure 6.11 and identify the threadIdx.y and threadIdx.x values of the thread that loads each element of the tile. Lines 5, 6, 9, and 10 in Figure 6.11 form a frequently used programming pattern for loading matrix elements into shared memory in tiled algorithms. We would also like to encourage readers to analyze the data access pattern by the dot-product loop in lines 12 and 13. Note that the threads in a warp do not access consecutive location of Mds. This is not a problem since Mds is in shared memory, which does not require coalescing to achieve high-speed data access.
流式多处理器 (SM) 中的执行资源包括寄存器、共享内存、线程块槽和线程槽。这些资源被动态分区并分配给线程以支持它们的执行。在第 4 章中,我们看到当前一代设备有 1,536 个线程槽,每个槽可容纳一个线程。这些线程槽在运行时被分区并分配给线程块。如果每个线程块由 512 个线程组成,则 1,536 个线程槽被划分并分配给三个块。在这种情况下,由于线程槽位的限制,每个SM最多可以容纳3个线程块。如果每个线程块包含128个线程,则将1,536个线程槽划分并分配给12个线程块。在线程块之间动态划分线程槽的能力使得 SM 具有多种用途。它们可以执行多个线程块,每个线程块具有几个线程,或者执行几个线程块,每个线程块具有多个线程。这与固定分区方法相反,在固定分区方法中,每个块接收固定数量的资源,而不管其实际需求如何。当块具有很少的线程并且无法支持需要比固定分区允许的更多线程槽的块时,固定分区会导致线程槽的浪费。
The execution resources in a streaming multiprocessor (SM) include registers, shared memory, thread block slots, and thread slots. These resources are dynamically partitioned and assigned to threads to support their execution. In Chapter 4, we have seen that the current generation of devices have 1,536 thread slots, each of which can accommodate one thread. These thread slots are partitioned and assigned to thread blocks during runtime. If each thread block consists of 512 threads, the 1,536 thread slots are partitioned and assigned to three blocks. In this case, each SM can accommodate up to three thread blocks due to limitations on thread slots. If each thread block contains 128 threads, the 1,536 thread slots are partitioned and assigned to 12 thread blocks. The ability to dynamically partition the thread slots among thread blocks makes SMs versatile. They can either execute many thread blocks each having few threads, or execute few thread blocks each having many threads. This is in contrast to a fixed partitioning method where each block receives a fixed amount of resources regardless of their real needs. Fixed partitioning results in wasted thread slots when a block has few threads and fails to support blocks that require more thread slots than the fixed partition allows.
资源的动态分区可能会导致资源限制之间微妙的相互作用,从而导致资源利用不足。这种交互可以发生在块槽和线程槽之间。例如,如果每个块有128个线程,则可以将1,536个线程槽划分并分配给12个块。然而,由于每个 SM 中只有 8 个块槽,因此只允许使用 8 个块。这意味着仅使用 1,024 个线程槽。因此,为了充分利用块槽和线程槽,每个块中至少需要256个线程。
Dynamic partitioning of resources can lead to subtle interactions between resource limitations, which can cause underutilization of resources. Such interactions can occur between block slots and thread slots. For example, if each block has 128 threads, the 1,536 thread slots can be partitioned and assigned to 12 blocks. However, since there are only 8 block slots in each SM, only 8 blocks will be allowed. This means that only 1,024 of the thread slots will be utilized. Therefore, to fully utilize both the block slots and thread slots, one needs at least 256 threads in each block.
正如我们在第 4 章中提到的,CUDA 内核中声明的自动变量被放入寄存器中。一些内核可能使用大量自动变量,而另一些内核可能使用很少的自动变量。因此,我们应该预料到一些内核需要很多寄存器,而另一些则需要更少。通过在块之间动态划分寄存器,如果需要很少的寄存器,SM 可以容纳更多的块;如果需要更多的寄存器,则可以容纳更少的块。然而,人们确实需要意识到寄存器限制和其他资源限制之间的潜在相互作用。
As we mentioned in Chapter 4, the automatic variables declared in a CUDA kernel are placed into registers. Some kernels may use lots of automatic variables and others may use few of them. Thus, one should expect that some kernels require many registers and some require fewer. By dynamically partitioning the registers among blocks, the SM can accommodate more blocks if they require few registers and fewer blocks if they require more registers. One does, however, need to be aware of potential interactions between register limitations and other resource limitations.
在矩阵乘法示例中,假设每个 SM 有 16,384 个寄存器,并且内核代码每个线程使用 10 个寄存器。如果我们有 16×16 个线程块,那么每个 SM 上可以运行多少个线程?我们可以通过首先计算每个块所需的寄存器数量来回答这个问题,即10×16×16=2,560。六个块所需的寄存器数量为 15,360,低于 16,384 的限制。添加另一个块将需要 17,920 个寄存器,这超出了限制。因此,寄存器限制允许在每个 SM 上运行总共有 1,536 个线程的块,这也符合块槽和 1,536 个线程槽的限制。
In the matrix multiplication example, assume that each SM has 16,384 registers and the kernel code uses 10 registers per thread. If we have 16×16 thread blocks, how many threads can run on each SM? We can answer this question by first calculating the number of registers needed for each block, which is 10×16×16=2,560. The number of registers required by six blocks is 15,360, which is under the 16,384 limit. Adding another block would require 17,920 registers, which exceeds the limit. Therefore, the register limitation allows blocks that altogether have 1,536 threads to run on each SM, which also fits within the limit of block slots and 1,536 thread slots.
现在假设程序员在内核中声明另外两个自动变量,并将每个线程使用的寄存器数量增加到 12 个。假设相同的 16×16 块,每个块现在需要12×16×16=3072个寄存器。 6 个块所需的寄存器数量现在为 18,432 个,超出了寄存器限制。 CUDA运行时系统通过将分配给每个SM的块数量减少1来处理这种情况,从而将所需的注册数量减少到15,360个。但是,这会将 SM 上运行的线程数从 1,536 减少到 1,280。也就是说,通过使用两个额外的自动变量,程序发现每个 SM 中的扭曲并行度减少了六分之一。这有时被称为“性能悬崖”,其中资源使用量的轻微增加可能会导致并行性和性能的显着降低[RRS2008]。读者可以参考 CUDA 占用计算器[NVIDIA],它是一个可下载的 Excel 表格,根据内核对资源的使用情况,计算特定设备实现的每个 SM 上运行的实际线程数。
Now assume that the programmer declares another two automatic variables in the kernel and bumps the number of registers used by each thread to 12. Assuming the same 16×16 blocks, each block now requires 12×16×16=3,072 registers. The number of registers required by six blocks is now 18,432, which exceeds the register limitation. The CUDA runtime system deals with this situation by reducing the number of blocks assigned to each SM by one, thus reducing the number of registered required to 15,360. This, however, reduces the number of threads running on an SM from 1,536 to 1,280. That is, by using two extra automatic variables, the program saw a one-sixth reduction in the warp parallelism in each SM. This is sometimes a referred to as a “performance cliff” where a slight increase in resource usage can result in significant reduction in parallelism and performance achieved [RRS2008]. Readers are referred to the CUDA Occupancy Calculator [NVIDIA], which is a downloadable Excel sheet that calculates the actual number of threads running on each SM for a particular device implementation given the usage of resources by a kernel.
性能调优中一个重要的算法决策是线程的粒度。在每个线程中投入更多工作并使用更少的线程通常是有利的。当线程之间存在一些冗余工作时,就会出现这种优势。在当前一代的设备中,每个 SM 的指令处理带宽有限。每条指令都会消耗指令处理带宽,无论是浮点计算指令、加载指令还是分支指令。消除冗余指令可以缓解指令处理带宽的压力,提高内核的整体执行速度。
An important algorithmic decision in performance tuning is the granularity of threads. It is often advantageous to put more work into each thread and use fewer threads. Such advantage arises when some redundant work exists between threads. In the current generation of devices, each SM has limited instruction processing bandwidth. Every instruction consumes instruction processing bandwidth, whether it is a floating-point calculation instruction, a load instruction, or a branch instruction. Eliminating redundant instructions can ease the pressure on the instruction processing bandwidth and improve the overall execution speed of the kernel.
图 6.12说明了矩阵乘法中的这种机会。图 6.11中的平铺算法使用一个线程来计算输出d_P矩阵的一个元素。这需要d_M的一行和d_N的一列之间的点积。
Figure 6.12 illustrates such an opportunity in matrix multiplication. The tiled algorithm in Figure 6.11 uses one thread to compute one element of the output d_P matrix. This requires a dot product between one row of d_M and one column of d_N.
图 6.12使用矩形图块增加了线程粒度。
Figure 6.12 Increased thread granularity with rectangular tiles.
线程粒度调整的机会来自于多个块冗余地加载每个d_M瓦片的事实。如图6.12所示,相邻图块中的两个d_P元素的计算使用相同的d_M行。使用原始分片算法,相同的d_M行由分配的两个块冗余加载以生成这两个Pd分片。可以通过将两个线程块合并为一个来消除这种冗余。新线程块中的每个线程现在计算两个d_P元素。这是通过修改内核来完成的,以便由内核的最内层循环计算两个点积。两个点积都使用Mds行相同但Nds列不同。这将全局内存访问减少了四分之一。鼓励读者编写新内核作为练习。
The opportunity of thread granularity adjustment comes from the fact that multiple blocks redundantly load each d_M tile. As shown in Figure 6.12, the calculation of two d_P elements in adjacent tiles uses the same d_M row. With the original tiled algorithm, the same d_M row is redundantly loaded by the two blocks assigned to generate these two Pd tiles. One can eliminate this redundancy by merging the two thread blocks into one. Each thread in the new thread block now calculates two d_P elements. This is done by revising the kernel so that two dot products are computed by the innermost loop of the kernel. Both dot products use the same Mds row but different Nds columns. This reduces the global memory access by one-quarter. Readers are encouraged to write the new kernel as an exercise.
潜在的缺点是新内核现在使用更多的寄存器和共享内存。正如我们在上一节中讨论的,每个 SM 上可以运行的块数量可能会减少。它还将线程块的总数减少了一半,这可能会导致较小维度的矩阵的并行度不足。在实践中,我们发现组合最多四个相邻水平块来计算相邻水平图块可以提高大型(2,048×2,048 或更多)矩阵乘法的性能。
The potential downside is that the new kernel now uses even more registers and shared memory. As we discussed in the previous section, the number of blocks that can be running on each SM may decrease. It also reduces the total number of thread blocks by half, which may result in an insufficient amount of parallelism for matrices of smaller dimensions. In practice, we found that combining up to four adjacent horizontal blocks to compute adjacent horizontal tiles improves the performance of large (2,048×2,048 or more) matrix multiplication.
在本章中,我们回顾了 CUDA 设备上 CUDA C 应用程序性能的主要方面:控制流发散、全局内存合并、动态资源分区和指令混合。我们提出了为这些性能方面创建良好程序模式的实用技术。我们将在接下来的几章中继续在案例研究中研究这些技术的实际应用。
In this chapter, we reviewed the major aspects of CUDA C application performance on a CUDA device: control flow divergence, global memory coalescing, dynamic resource partitioning, and instruction mixes. We presented practical techniques for creating good program patterns for these performance aspects. We will continue to study practical applications of these techniques in the case studies in the next few chapters.
6.1 图 6.2和6.4中的内核在线程的使用上是浪费的;每个块中的一半线程永远不会执行。修改内核以消除此类浪费。在内核启动时给出相关的执行配置参数值。需要额外的算术运算是否有成本?通过此类修改可以解决哪些资源限制? (提示:第 2 行和/或第 4 行可以在每种情况下进行调整;该部分中的元素数量可能会增加。)
6.1 The kernels in Figures 6.2 and 6.4 are wasteful in their use of threads; half of the threads in each block never execute. Modify the kernels to eliminate such waste. Give the relevant execute configuration parameter values at the kernel launch. Is there a cost in terms of an extra arithmetic operation needed? Which resource limitation can be potentially addressed with such modification? (Hint: line 2 and/or line 4 can be adjusted in each case; the number of elements in the section may increase.)
6.2 比较您为练习 6.1编写的修改后的内核。哪种修改引入了更少的额外算术运算?
6.2 Compare the modified kernels you wrote for Exercise 6.1. Which modification introduced fewer additional arithmetic operations?
6.3 根据练习 6.1编写完整的内核,方法是 (1) 添加将输入数组的一部分从全局内存加载到共享内存的语句,(2) 使用blockIdx.x允许多个块在输入数组的不同部分上工作,以及(3)根据blockIdx.x将该节的减少值写入到某个位置,以便所有块将其节减少值存放到全局存储器中输入数组的下部。
6.3 Write a complete kernel based on Exercise 6.1 by (1) adding the statements that load a section of the input array from global memory to shared memory, (2) using blockIdx.x to allow multiple blocks to work on different sections of the input array, and (3) writing the reduction value for the section to a location according to the blockIdx.x so that all blocks will deposit their section reduction value to the lower part of the input array in global memory.
6.4 根据您为练习 6.3编写的内核设计一个缩减程序。主机代码应该 (1) 将大型输入数组传输到全局内存,并且 (2) 使用循环重复调用您为练习 6.3编写的内核,并调整执行配置参数值,以便输入数组的归约结果将最终被生产出来。
6.4 Design a reduction program based on the kernel you wrote for Exercise 6.3. The host code should (1) transfer a large input array to the global memory, and (2) use a loop to repeatedly invoke the kernel you wrote for Exercise 6.3 with adjusted execution configuration parameter values so that the reduction result for the input array will eventually be produced.
6.5 对于图 6.11中的矩阵乘法内核,针对较小的 16×16 矩阵大小,在第 9 行和第 10 行的扭曲中绘制线程的访问模式。计算warp 中每个线程的tx和ty值并在第 9 行和第 10 行的d_M和d_N索引计算中使用这些值。表明线程在每次迭代期间确实访问全局内存中的连续d_M和d_N位置。
6.5 For the matrix multiplication kernel in Figure 6.11, draw the access patterns of threads in a warp of lines 9 and 10 for a small 16×16 matrix size. Calculate the tx and ty values for each thread in a warp and use these values in the d_M and d_N index calculations in lines 9 and 10. Show that the threads indeed access consecutive d_M and d_N locations in global memory during each iteration.
6.6 对于基于行优先布局的简单矩阵-矩阵乘法 (M –× N),哪个输入矩阵将具有合并访问?
6.6 For the simple matrix–matrix multiplication (M –× N) based on row-major layout, which input matrix will have coalesced accesses?
6.7 对于基于行主布局的平铺矩阵-矩阵乘法 (M × N),哪个输入矩阵将具有合并访问?
6.7 For the tiled matrix–matrix multiplication (M × N) based on row-major layout, which input matrix will have coalesced accesses?
6.8 对于简单归约内核,如果块大小为 1,024,扭曲大小为 32,则在第五次迭代期间,块中的多少个扭曲会出现分歧?
6.8 For the simple reduction kernel, if the block size is 1,024 and warp size is 32, how many warps in a block will have divergence during the fifth iteration?
6.9 对于改进后的缩减内核,如果块大小为 1,024,warp 大小为 32,则在第五次迭代期间有多少个 warp 会发散?
6.9 For the improved reduction kernel, if the block size is 1,024 and warp size is 32, how many warps will have divergence during the fifth iteration?
6.10 编写对应于图 6.12所示设计的矩阵乘法核函数。
6.10 Write a matrix multiplication kernel function that corresponds to the design illustrated in Figure 6.12.
6.11 以下标量产品代码测试您对基本 CUDA 模型的理解。以下代码计算 1,024 个点积,每个点积都是根据一对 256 元素向量计算的。假设代码在G80上执行。使用代码回答以下问题。
1 #定义VECTOR_N 1024
2 #定义 ELEMENT_N 256
3 const int DATA_N = VECTOR_N * ELEMENT_N;
4 const int DATA_SZ = DATA_N * sizeof(float);
5 const int RESULT_SZ = VECTOR_N * sizeof(float);
……
6 浮点数*d_A、*d_B、*d_C;
……
7 cudaMalloc((void ***)&d_A, DATA_SZ);
8 cudaMalloc((void ***)&d_B, DATA_SZ);
9 cudaMalloc((void ***)&d_C, RESULT_SZ);
……
10 标量Prod<<<矢量_N,元素_N>>>(d_C,d_A,d_B,元素_N);
11
12 __global__ 无效
13 scalarProd(浮点*d_C,浮点*d_A,浮点*d_B,int ElementN)
14{
第 15 章
16 //当前向量基数
17 float *A = d_A + ElementN * blockIdx.x;
18 float *B = d_B + ElementN * blockIdx.x;
19 int tx = threadIdx.x;
20
21 AccumResult[tx] = A[tx] * B[tx];
22 号
23 for(int 步幅 = ElementN /2; 步幅 > 0; 步幅 >>= 1)
24{
25 __syncthreads();
26 if(tx < 步幅)
27accumResult[tx]+=accumResult[步长+tx];
28}
30 d_C[blockIdx.x] = AccumResult[0];
31}
6.11 The following scalar product code tests your understanding of the basic CUDA model. The following code computes 1,024 dot products, each of which is calculated from a pair of 256-element vectors. Assume that the code is executed on G80. Use the code to answer the following questions.
1 #define VECTOR_N 1024
2 #define ELEMENT_N 256
3 const int DATA_N = VECTOR_N ∗ ELEMENT_N;
4 const int DATA_SZ = DATA_N ∗ sizeof(float);
5 const int RESULT_SZ = VECTOR_N ∗ sizeof(float);
…
6 float ∗d_A, ∗d_B, ∗d_C;
…
7 cudaMalloc((void ∗∗)&d_A, DATA_SZ);
8 cudaMalloc((void ∗∗)&d_B, DATA_SZ);
9 cudaMalloc((void ∗∗)&d_C, RESULT_SZ);
…
10 scalarProd<<<VECTOR_N, ELEMENT_N>>>(d_C, d_A, d_B, ELEMENT_N);
11
12 __global__ void
13 scalarProd(float ∗d_C, float ∗d_A, float ∗d_B, int ElementN)
14 {
15 __shared__ float accumResult[ELEMENT_N];
16 //Current vectors bases
17 float ∗A = d_A + ElementN ∗ blockIdx.x;
18 float ∗B = d_B + ElementN ∗ blockIdx.x;
19 int tx = threadIdx.x;
20
21 accumResult[tx] = A[tx] ∗ B[tx];
22
23 for(int stride = ElementN /2; stride > 0; stride >>= 1)
24 {
25 __syncthreads();
26 if(tx < stride)
27 accumResult[tx] += accumResult[stride + tx];
28 }
30 d_C[blockIdx.x] = accumResult[0];
31 }
a. How many threads are there in total?
b. How many threads are there in a warp?
c. How many threads are there in a block?
d. How many global memory loads and stores are done for each thread?
e. How many accesses to shared memory are done for each block?
f. List the source code lines, if any, that cause shared memory bank conflicts.
G。 for循环(第 23 行)的多少次迭代会出现分支分歧?显示你的推导。
g. How many iterations of the for loop (line 23) will have branch divergence? Show your derivation.
H。 找到显着降低全局内存带宽要求的机会。你将如何实现这一目标?您可以消除多少次访问?
h. Identify an opportunity to significantly reduce the bandwidth requirement on the global memory. How would you achieve this? How many accesses can you eliminate?
6.12 在练习 4.2中,在BLOCK_SIZE的可能值范围之外,对于什么值的BLOCK_SIZE内核将完全避免对全局内存的未合并访问?
6.12 In Exercise 4.2, out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory?
6.13 为了提高性能,一位聪明的年轻工程师将图 6.4中的 CUDA 代码更改为以下内容。
__shared__ 浮动部分和[];
无符号整型 tid = threadIdx.x;
for (unsigned int stride = n>>1; stride >= 32; stride >>= 1) {
__syncthreads();
if (tid < 步幅)
共享[tid] += 共享[tid + 步幅];
}
__syncthreads();
if (tid < 32) { // 展开最后 5 个谓词步骤
共享[tid] += 共享[tid + 16];
共享[tid] += 共享[tid + 8];
共享[tid] += 共享[tid + 4];
共享[tid] += 共享[tid + 2];
共享[tid] += 共享[tid + 1];
}
6.13 In an attempt to improve performance, a bright young engineer changed the CUDA code in Figure 6.4 into the following.
__shared__ float partialSum[];
unsigned int tid = threadIdx.x;
for (unsigned int stride = n>>1; stride >= 32; stride >>= 1) {
__syncthreads();
if (tid < stride)
shared[tid] += shared[tid + stride];
}
__syncthreads();
if (tid < 32) { // unroll last 5 predicated steps
shared[tid] += shared[tid + 16];
shared[tid] += shared[tid + 8];
shared[tid] += shared[tid + 4];
shared[tid] += shared[tid + 2];
shared[tid] += shared[tid + 1];
}
a. Do you believe that the performance will be improved? Why or why not?
1. CUDA 占用计算器。
1. CUDA Occupancy Calculator.
2. CUDA C.最佳实践指南。 2012;诉。 4.2.
2. CUDA C. Best Practices Guide. 2012;v. 4.2.
3. Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S., Stratton, J., & Hwu, W. 多线程 GPU 的程序优化空间修剪,第六届会议录ACM/IEEE 国际代码生成和优化研讨会,2008 年 4 月 6 日至 9 日。
3. Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S., Stratton, J., & Hwu, W. Program optimization space pruning for a multithreaded GPU, Proceedings of the 6th ACM/IEEE International Symposium on Code Generation and Optimization, April 6–9, 2008.
4. Ryoo, S., Rodrigues, CI, Baghsorkhi, SS, Stone, SS, Kirk, DB, & Hwu, WW 使用 CUDA 的多线程 GPU 的优化原理和应用性能评估,第 13 届 ACM SIGPLAN 原则和研讨会论文集并行编程实践,2008 年 2 月。
4. Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., & Hwu, W. W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008.
1请注意,使用与节中元素数量相同数量的线程是浪费的。块中的一半线程永远不会执行。鼓励读者修改内核和内核启动执行配置参数来消除这种浪费(参见练习 6.1)。
1Note that using the same number of threads as the number of elements in a section is wasteful. Half of the threads in a block will never execute. Readers are encouraged to modify the kernel and the kernel launch execution configuration parameters to eliminate this waste (see Exercise 6.1).
2最近的 CUDA 设备使用片上缓存来存储全局内存数据。此类缓存会自动合并更多内核访问模式,并在一定程度上减少程序员手动重新安排其访问模式的需要。然而,即使有了缓存,合并技术在可预见的将来仍将继续对内核执行性能产生重大影响。
2Recent CUDA devices use on-chip caches for global memory data. Such caches automatically coalesce more of the kernel access patterns and somewhat reduce the need for programmers to manually rearrange their access patterns. However, even with caches, coalescing techniques will continue to have a significant effect on kernel execution performance in the foreseeable future.
3不同的 CUDA 设备也可能对N施加对齐要求。例如,在某些 CUDA 设备中, N需要与 16 字边界对齐。即N的低6位应全部为0位。我们将在第 12 章中讨论解决这种对齐要求的技术。
3Different CUDA devices may also impose alignment requirements on N. For example, in some CUDA devices, N is required to be aligned to 16-word boundaries. That is, the lower 6 bits of N should all be 0 bits. We will discuss techniques that address this alignment requirement in Chapter 12.
7.1 浮点格式
7.1 Floating-Point Format
7.2 可表示的数字
7.2 Representable Numbers
7.3 IEEE 格式的特殊位模式和精度
7.3 Special Bit Patterns and Precision in IEEE Format
7.4 算术精度和舍入
7.4 Arithmetic Accuracy and Rounding
7.5 算法考虑因素
7.5 Algorithm Considerations
7.6 数值稳定性
7.6 Numerical Stability
7.7 概括
7.7 Summary
7.8 练习
7.8 Exercises
在计算的早期,浮点运算能力仅存在于大型机和超级计算机中。尽管许多20世纪80年代设计的微处理器开始配备浮点协处理器,但它们的浮点运算速度极其缓慢,比大型机和超级计算机慢大约三个数量级。随着微处理器技术的进步,许多 20 世纪 90 年代设计的微处理器,例如 Intel Pentium III 和 AMD Athlon,开始具备可与超级计算机相媲美的高性能浮点功能。高速浮点运算已成为当今微处理器和 GPU 的标准功能。因此,对于应用程序员来说,在开发应用程序时理解并利用浮点运算也变得很重要。我们将特别关注浮点运算的准确性、浮点数表示的精度以及在并行计算中如何考虑它们。
In the early days of computing, floating-point arithmetic capability was found only in mainframes and supercomputers. Although many microprocessors designed in the 1980s started to have floating-point coprocessors, their floating-point arithmetic speed was extremely slow, about three orders of magnitude slower than that of mainframes and supercomputers. With advances in microprocessor technology, many microprocessors designed in the 1990s, such as Intel Pentium III and AMD Athlon, started to have high-performance floating-point capabilities that rival supercomputers. High-speed floating-point arithmetic has become a standard feature for microprocessors and GPUs today. As a result, it has also become important for application programmers to understand and take advantage of floating-point arithmetic in developing their applications. In particular, we will focus on the accuracy of floating-point arithmetic, the precision of floating-point number representation, and how they should be taken into consideration in parallel computing.
IEEE-754 浮点标准是计算机制造商为遵守浮点数据的通用表示和算术行为而做出的努力[IEEE2008]。世界上大多数(如果不是全部)计算机制造商都接受了该标准。特别是,几乎所有未来设计的微处理器都将完全符合或几乎完全符合 IEEE-754 浮点标准及其更新的 IEEE-754 2008 修订版 [IEEE2008]。因此,应用程序开发人员了解该标准的概念和实际考虑因素非常重要。
The IEEE-754 Floating-Point Standard is an effort for the computer manufacturers to conform to a common representation and arithmetic behavior for floating-point data [IEEE2008]. Most, if not all, of the computer manufacturers in the world have accepted this standard. In particular, virtually all microprocessors designed in the future will either fully conform to or almost fully conform to the IEEE-754 Floating-Point Standard and its more recent IEEE-754 2008 revision [IEEE2008]. Therefore, it is important for application developers to understand the concept and practical considerations of this standard.
浮点数系统首先将数值表示为位模式。在 IEEE-754 浮点标准中,数值由三组位表示:符号 ( S )、指数 ( E ) 和尾数 ( M )。除了稍后将详细介绍的一些例外情况外,每个 ( S、E、M ) 模式根据以下公式唯一标识一个数值:
A floating-point number system starts with the representation of a numerical value as bit patterns. In the IEEE-754 Floating-Point Standard, a numerical value is represented in three groups of bits: sign (S), exponent (E), and mantissa (M). With some exceptions that will be detailed later, each (S, E, M) pattern uniquely identifies a numeric value according to the following formula:
(7.1)
(7.1)
S的解释很简单:S =0表示正数,S =1表示负数。从数学上讲,任何数字(包括 -1)在求 0 次方时都会得到 1。因此,该值为正数。另一方面,当-1 的1 次方时,它本身就是-1。乘以 -1,该值变为负数。然而, M和E位的解释要复杂得多。我们将使用下面的例子来帮助解释M和E位的解释。
The interpretation of S is simple: S=0 means a positive number and S=1 a negative number. Mathematically, any number, including −1, when raised to the power of 0, results in 1. Thus, the value is positive. On the other hand, when −1 is raised to the power of 1, it is −1 itself. With a multiplication by −1, the value becomes negative. The interpretation of M and E bits are, however, much more complex. We will use the following example to help explain the interpretation of M and E bits.
为了简单起见,假设每个浮点数由 1 位符号、3 位指数和 2 位尾数组成。我们将使用这个假设的 6 位格式来说明编码E和M所涉及的挑战。当我们讨论数值时,有时需要用小数位值或二进制位值来表达数字。以小数位值表示的数字将具有下标D,而以二进制位值表示的数字将具有下标B。例如,0.5 D(5×10 -1因为小数点右边的位置权重为 10 -1 )与 0.1 B(1×2 -1因为小数点右边的位置)相同点的权重为 2 −1 )。
Assume for the sake of simplicity that each floating-point number consists of a 1-bit sign, 3-bit exponent, and 2-bit mantissa. We will use this hypothetical 6-bit format to illustrate the challenges involved in encoding E and M. As we discuss numeric values, we will sometimes need to express a number either in decimal place value or in binary place value. Numbers expressed in decimal place value will have subscript D and those as binary place value will have subscript B. For example, 0.5D (5×10−1 since the place to the right of the decimal point carries a weight of 10−1) is the same as 0.1B (1×2−1 since the place to the right of the decimal point carries a weight of 2−1).
方程(7.1)要求所有值都是通过将尾数值视为 1. M来导出的,这使得每个浮点的尾数位模式编号唯一。例如,0.5 D允许的唯一一种尾数位模式是表示M的所有位均为 0 的尾数位模式:
Equation (7.1) requires that all values are derived by treating the mantissa value as 1.M, which makes the mantissa bit pattern for each floating-point number unique. For example, the only one mantissa bit pattern allowed for 0.5D is the one where all bits that represent M are 0’s:
其他潜在的候选者是 0.1 B ×2 0和 10.0 B ×2 −2 ,但都不符合 1. M的形式。满足此限制的数字将被称为标准化数字。因为满足限制的所有尾数值的形式都是 1.XX,所以我们可以省略“1”。部分来自表示。因此,2 位尾数表示中的 0.5 的尾数值为 00,这是通过省略“1”得出的。从 1.00 开始。这使得 2 位尾数实际上成为 3 位尾数。一般而言,对于 IEEE 格式,m位尾数实际上是 ( m + 1) 位尾数。
Other potential candidates would be 0.1B×20 and 10.0B×2−2, but neither fits the form of 1.M. The numbers that satisfy this restriction will be referred to as normalized numbers. Because all mantissa values that satisfy the restriction are of the form 1.XX, we can omit the “1.” part from the representation. Therefore, the mantissa value of 0.5 in a 2-bit mantissa representation is 00, which is derived by omitting “1.” from 1.00. This makes a 2-bit mantissa effectively a 3-bit mantissa. In general, with IEEE format, an m-bit mantissa is effectively an (m + 1)-bit mantissa.
用于表示E的位数决定了可以表示的数字范围。大的正E值会导致非常大的浮点绝对值。例如,如果E的值为64,则表示的浮点数在 2 64 (> 10 18 ) 和 2 65之间。如果这是您储蓄账户的余额,您会非常高兴!大的负E值会导致非常小的浮点值。例如,如果E值为-64,则表示的数字在2 -64 (<10 -18)和2 -63之间。这是一个非常小的小数。E字段允许浮点数格式表示比整数格式更广泛的数字。当我们查看格式的可表示数字时,我们将回到这一点。
The number of bits used to represent E determines the range of numbers that can be represented. Large positive E values result in very large floating-point absolute values. For example, if the value of E is 64, the floating-point number being represented is between 264 (> 1018) and 265. You would be extremely happy if this was the balance of your savings account! Large negative E values result in very small floating-point values. For example, if the E value is −64, the number being represented is between 2−64 (<10−18) and 2−63. This is a very tiny fractional number. The E field allows a floating-point number format to represent a wider range of numbers than integer number formats. We will come back to this point when we look at the representable numbers of a format.
IEEE 标准对E采用过多或有偏差的编码约定。如果使用e位来表示指数E,则将(2 e− 1 -1)添加到指数的2的补码表示中以形成其多余表示。 2 的补码表示是一种系统,其中可以通过首先对值的每一位求补并将结果加 1 来导出数字的负值。在我们的 3 位指数表示中,指数有 3 位 ( e =3)。因此,值2 3−1 -1=011 将被添加到指数值的2的补码表示中。
The IEEE standard adopts an excess or biased encoding convention for E. If e bits are used to represent the exponent E, (2e−1 −1) is added to the 2’s complement representation for the exponent to form its excess representation. A 2’s complement representation is a system where the negative value of a number can be derived by first complementing every bit of the value and add 1 to the result. In our 3-bit exponent representation, there are 3 bits in the exponent (e=3). Therefore, the value 23−1 −1=011 will be added to the 2’s complement representation of the exponent value.
多余表示的优点是可以使用无符号比较器来比较有符号数。如图7.1所示,在我们的 3 位指数表示中,当视为无符号数时,超 3 位模式从 –3 单调增加到 3。我们将把这些位模式中的每一个称为相应值的代码。为了例如,–3 的代码是 000,3 的代码是 110。因此,如果使用无符号数比较器来比较 –3 到 3 之间的任何数字的多余 3 代码,则比较器会给出正确的比较结果:另一个例子,如果用无符号比较器比较多余 3 代码 001 和 100,则 001 小于 100。这是正确的结论,因为它们代表的值是 -2 和 1 ,具有完全相同的关系。这是硬件实现的理想特性,因为无符号比较器比有符号比较器更小且更快。
The advantage of excess representation is that an unsigned comparator can be used to compare signed numbers. As shown in Figure 7.1, in our 3-bit exponent representation, the excess-3 bit patterns increase monotonically from –3 to 3 when viewed as unsigned numbers. We will refer to each of these bit patterns as the code for the corresponding value. For example, the code for –3 is 000 and that for 3 is 110. Thus, if one uses an unsigned number comparator to compare excess-3 code for any number from –3 to 3, the comparator gives the correct comparison result in terms of which number is larger, smaller, etc. For another example, if one compares excess-3 codes 001 and 100 with an unsigned comparator, 001 is smaller than 100. This is the right conclusion since the values that they represent, −2 and 1, have exactly the same relation. This is a desirable property for hardware implementation since unsigned comparators are smaller and faster than signed comparators.
图7.1 Extra-3编码,按excess-3排序。
Figure 7.1 Excess-3 encoding, sorted by excess-3 ordering.
图 7.1还表明,多余表示中全 1 的模式是保留模式。请注意,0 值以及相同数量的正值和负值会导致奇数个模式。使图案111为偶数或奇数将导致偶数和奇数的数量不平衡。 IEEE 标准以特殊方式使用这种特殊位模式,稍后将讨论。
Figure 7.1 also shows that the pattern of all 1’s in the excess representation is a reserved pattern. Note that a 0 value and an equal number of positive and negative values results in an odd number of patterns. Having the pattern 111 as either even number or odd number would result in an unbalanced number of even and odd numbers. The IEEE standard uses this special bit pattern in special ways that will be discussed later.
现在我们准备用 6 位格式表示 0.5 D :
Now we are ready to represent 0.5D with our 6-bit format:
即 0.5 D的 6 位表示为 001000。
That is, the 6-bit representation for 0.5D is 001000.
一般来说,使用归一化尾数和过量编码指数,具有n位指数的数字的值是
In general, with a normalized mantissa and excess-coded exponent, the value of a number with an n-bit exponent is
数字格式的可表示数字是可以在该格式中精确表示的数字。例如,如果使用 3 位无符号整数格式,则可表示的数字如图7.2所示。
The representable numbers of a number format are the numbers that can be exactly represented in the format. For example, if one uses a 3-bit unsigned integer format, the representable numbers are shown in Figure 7.2.
图 7.2可表示的 3 位无符号整数格式的数字。
Figure 7.2 Representable numbers of a 3-bit unsigned integer format.
-1 和 9 都不能用图 7.2中给出的格式表示。我们可以画一条数轴来标识所有可表示的数字,如图7.3所示,其中所有可表示的 3 位无符号整数格式的数字都用星号标记。
Neither −1 nor 9 can be represented in the format given in Figure 7.2. We can draw a number line to identify all the representable numbers, as shown in Figure 7.3 where all representable numbers of the 3-bit unsigned integer format are marked with stars.
图 7.3可表示的 3 位无符号整数格式的数字。
Figure 7.3 Representable numbers of a 3-bit unsigned integer format.
浮点格式的可表示数字可以以类似的方式可视化。在图 7.4中,我们显示了迄今为止我们所拥有的所有可表示的数字以及两个变体。我们使用 5 位格式来保持表的大小易于管理。该格式由 1 位S、2 位E(多余 1 编码)和 2 位M(省略“1.”部分)组成。非零列给出了我们迄今为止讨论的格式的可表示数字。鼓励读者根据第 7.1 节中给出的公式生成至少部分非零列。请注意,使用此格式时,0 不是可表示的数字之一。
The representable numbers of a floating-point format can be visualized in a similar manner. In Figure 7.4, we show all the representable numbers of what we have so far and two variations. We use a 5-bit format to keep the size of the table manageable. The format consists of 1-bit S, 2-bit E (excess-1 coded), and 2-bit M (with the “1.” part omitted). The no-zero column gives the representable numbers of the format we discussed thus far. Readers are encouraged to generate at least part of the no-zero column based on the formula given in Section 7.1. Note that with this format, 0 is not one of the representable numbers.
图 7.4非零、突然下溢和分母格式的可表示数。
Figure 7.4 Representable numbers of no-zero, abrupt underflow, and denorm formats.
快速浏览一下这些可表示的数字如何填充数轴(如图7.5所示),可以进一步了解这些可表示的数字。在图 7.5中,我们只显示了正向的可表示的数字。负数与 0 另一边的正数对称。
A quick look at how these representable numbers populate the number line, as shown in Figure 7.5, provides further insights about these representable numbers. In Figure 7.5, we show only the positive representable numbers. The negative numbers are symmetric to their positive counterparts on the other side of 0.
图 7.5非零表示法的可表示数。
Figure 7.5 Representable numbers of the no-zero representation.
我们可以提出五点观察。首先,指数位定义可表示数字的主要区间。在图 7.5中,0 的每一侧都有三个主要区间,因为有两个指数位。基本上,主要区间是在 2 的幂之间。使用 2 位指数和一个保留位模式 (11),存在 2 的三个幂(2 −1 = 0.5 D、2 0 =1.0 D、2 1 =2.0 D),并且每个幂都开始一个可表示数字的区间。请记住, 0 左侧还有 3 个 2 的幂(−2 −1 =−0.5 D、−2 0 =−1.0 D、−2 1 =−2.0 D ),这些在图 7.5中未显示。
We can make five observations. First, the exponent bits define the major intervals of representable numbers. In Figure 7.5, there are three major intervals on each side of 0 because there are two exponent bits. Basically, the major intervals are between powers of 2’s. With 2 bits of exponents and one reserved bit pattern (11), there are three powers of 2 (2−1 = 0.5D, 20=1.0D, 21=2.0D), and each starts an interval of representable numbers. Keep in mind that there are also three powers of 2 (−2−1 =−0.5D, −20=−1.0D, −21=−2.0D) to the left of 0 that are not shown in Figure 7.5.
第二个观察结果是尾数位定义每个间隔中可表示的数字的数量。通过两个尾数位,我们在每个区间中有四个可表示的数字。一般来说,对于N 个尾数位,我们在每个区间中有 2 N 个可表示的数字。如果要表示的值落在某一区间内,则该值将四舍五入为以下之一这些可代表的数字。显然,每个区间中可表示的数字越多,我们可以更精确地表示该区域中的值。因此,尾数位数决定了表示的精度。
The second observation is that the mantissa bits define the number of representable numbers in each interval. With two mantissa bits, we have four representable numbers in each interval. In general, with N mantissa bits, we have 2N representable numbers in each interval. If a value to be represented falls within one of the intervals, it will be rounded to one of these representable numbers. Obviously, the larger the number of representable numbers in each interval, the more precisely we can represent a value in the region. Therefore, the number of mantissa bits determines the precision of the representation.
第三个观察结果是 0 无法以这种格式表示。图 7.5的非零列中的可表示数字中缺少该值。因为 0 是最重要的数字之一,所以不能在数字表示系统中表示 0 是一个严重的缺陷。我们将很快解决这个缺陷。
The third observation is that 0 is not representable in this format. It is missing from the representable numbers in the no-zero column of Figure 7.5. Because 0 is one of the most important numbers, not being able to represent 0 in a number representation system is a serious deficiency. We will address this deficiency soon.
第四个观察结果是,可表示的数字在接近 0 的邻域时变得更加接近。当我们向 0 移动时,每个间隔的大小都是前一个间隔的一半。在图 7.5中,最右边的间隔宽度为 2,下一个间隔的宽度为 2。第一个宽度为 1,下一个宽度为 0.5。虽然图 7.5中未显示,但 0 左侧有 3 个区间。它们包含可表示的负数。最左边的间隔宽度为 2,下一个间隔宽度为 1,再下一个间隔宽度为 0.5。由于每个区间都有相同的可表示数(图 7.5中的 4 个),因此当我们向 0 移动时,可表示数变得更加接近。换句话说,随着绝对值变小,表示数变得更接近。这是一个理想的趋势,因为随着这些数字的绝对值变得更小,更精确地表示它们就变得更加重要。可表示的数字之间的距离决定了落入该区间的值的最大舍入误差。例如,如果您的银行帐户中有 10 亿美元,您甚至可能没有注意到计算余额时存在 1 美元的舍入误差。然而,如果总余额为 10 美元,则 1 美元的舍入误差会更加明显!
The fourth observation is that the representable numbers become closer to each other toward the neighborhood of 0. Each interval is half the size of the previous interval as we move toward 0. In Figure 7.5, the rightmost interval is of width 2, the next one is of width 1, and the next one is of width 0.5. While not shown in Figure 7.5, there are three intervals to the left of 0. They contain the representable negative numbers. The leftmost interval is of width 2, the next one is of width 1, and the next one is of width 0.5. Since every interval has the same representable numbers, four in Figure 7.5, the representable numbers becomes closer to each other as we move toward 0. In other words, the representative numbers become closer as their absolute values become smaller. This is a desirable trend, because as the absolute value of these numbers become smaller, it is more important to represent them more precisely. The distance between representable numbers determines the maximal rounding error for a value that falls into the interval. For example, if you have one billion dollars in your bank account, you may not even notice that there is a 1 dollar rounding error in calculating your balance. However, if the total balance is 10 dollars, having a 1 dollar rounding error would be much more noticeable!
不幸的是,第五个观察结果是,当我们向 0 移动时,可表示数字的密度增加,从而提高区间内表示数字的精度的趋势并不适用于 0 附近。也就是说,存在可表示数字的差距。紧邻 0 的数字。这是因为标准化尾数的范围排除了 0。这是另一个严重的缺陷。与 0.5 到 1.0 之间较大数字的误差相比,当表示 0 到 0.5 之间的数字时,该表示法引入了明显更大 (4×) 的误差。一般来说,尾数中有m位,这种表示方式会在最接近 0 的间隔中引入比下一个间隔多2 m倍的误差。对于依赖于基于非常小的数据值精确检测收敛条件的数值方法,这种缺陷可能会导致执行时间和结果准确性的不稳定。此外,某些算法会生成较小的数字并最终将其用作分母。表示这些小数字的误差在除法过程中可能会被大大放大,并导致这些算法中的数值不稳定。
The fifth observation is that, unfortunately, the trend of increasing density of representable numbers and thus increasing precision of representing numbers in the intervals as we move toward 0 does not hold for the very vicinity of 0. That is, there is a gap of representable numbers in the immediate vicinity of 0. This is because the range of normalized mantissa precludes 0. This is another serious deficiency. The representation introduces significantly larger (4×) errors when representing numbers between 0 and 0.5 compared to the errors for the larger numbers between 0.5 and 1.0. In general, with m bits in the mantissa, this style of representation would introduce 2m times more error in the interval closest to 0 than the next interval. For numerical methods that rely on accurate detection of convergence conditions based on very small data values, such deficiency can cause instability in execution time and accuracy of results. Furthermore, some algorithms generate small numbers and eventually use them as denominators. The errors in representing these small numbers can be greatly magnified in the division process and cause numerical instability in these algorithms.
一种可以将 0 容纳到标准化浮点数系统中的方法是突然下溢约定,如图7.4的第二列所示。每当E为 0 时,该数字就被解释为 0。在我们的 5 位格式中,此方法会去掉 0 附近(−1.0 和 +1.0 之间)的 8 个可表示的数字(四个正数和四个负数),并将它们全部变成0.由于其简单性,20世纪80年代的一些小型计算机使用了突然下溢。直到今天,一些需要高速运行的运算单元仍然采用突然下溢约定。虽然这种方法使 0 成为可表示的数字,但它在 0 附近的可表示的数字之间造成了更大的差距,如图7.6所示。很明显,与图7.5相比,可表示数的差距从0.5显着扩大(2倍)到1.0。正如我们之前所解释的,这对于许多数值算法来说是非常有问题的,这些算法的正确性依赖于接近 0 的小数的准确表示。
One method that can accommodate 0 into a normalized floating-point number system is the abrupt underflow convention, which is illustrated in the second column of Figure 7.4. Whenever E is 0, the number is interpreted as 0. In our 5-bit format, this method takes away eight representable numbers (four positive and four negative) in the vicinity of 0 (between −1.0 and +1.0) and makes them all 0. Due to its simplicity, some mini-computers in the 1980s used abrupt underflow. Even to this day, some arithmetic units that need to operate in high speed still use abrupt underflow convention. Although this method makes 0 a representable number, it creates an even larger gap between representable numbers in 0’s vicinity, as shown in Figure 7.6. It is obvious, when compared with Figure 7.5, that the gap of representable numbers has been enlarged significantly (by 2×) from 0.5 to 1.0. As we explained before, this is very problematic for many numerical algorithms of which the correctness reply on accurate representation of small numbers near 0.
图 7.6突变下溢格式的可表示数字。
Figure 7.6 Representable numbers of the abrupt underflow format.
IEEE标准实际采用的方法称为非规范化。该方法放宽了非常接近 0 的数字的归一化要求。如图7.8所示,每当E =0 时,尾数不再假定为 1.XX 形式。相反,它被假定为 0.XX。假定指数值与前一个区间相同。例如,在图 7.4中,非规格化表示 00001 具有指数值 00 和尾数值 01。假定尾数为 0.01,并且假定指数值与先前区间的指数值相同:0 而不是 -1。也就是说,该值00001 现在表示 0.01×2 0 =2 −2。图 7.7显示了非规范化格式的可表示数字。现在,该表示形式在 0 附近具有均匀间隔的可表示数字。直观地说,非规范化约定采用非零表示形式的最后一个可表示数字间隔中的四个数字,并将它们展开以覆盖间隙区域。这消除了前两种方法中不希望有的差距。请注意,最后两个区间中可表示数字之间的距离实际上是相同的。一般来说,如果n位指数为0,则值为
The actual method adopted by the IEEE standard is called denormalization. The method relaxes the normalization requirement for numbers very close to 0. As shown later in Figure 7.8, whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. The value of the exponent is assumed to be the same as the previous interval. For example, in Figure 7.4, the denormalized representation 00001 has exponent value 00 and mantissa value 01. The mantissa is assumed to be 0.01 and the exponent value is assumed to be the same as that of the previous interval: 0 rather than −1. That is, the value that 00001 represents is now 0.01×20=2−2. Figure 7.7 shows the representable numbers for the denormalized format. The representation now has uniformly spaced representable numbers in the close vicinity of 0. Intuitively, the denormalized convention takes the four numbers in the last interval of representable numbers of a no-zero representation and spreads them out to cover the gap area. This eliminates the undesirable gap in the previous two methods. Note that the distances between representable numbers in the last two intervals are actually identical. In general, if the n-bit exponent is 0, the value is
图 7.7非规范化格式的可表示数字。
Figure 7.7 Representable numbers of a denormalization format.
正如我们所看到的,非规范化公式相当复杂。硬件还需要能够检测数字是否落入非规范化区间,并为该数字选择适当的表示形式。高速实现反规范化所需的硬件数量相当大。每当需要生成或使用非规格化数字时,使用适量硬件的实现通常会引入数千个时钟周期的延迟。这就是早期 CUDA 设备不支持非规范化的原因。然而,由于最新制造工艺的可用晶体管数量不断增加,几乎所有新一代 CUDA 设备都支持非规范化。更具体地说,计算能力1.3及更高版本的所有CUDA设备都支持非规范化双精度操作数,计算能力2.0及更高版本的所有设备都支持非规范化单精度操作数。
As we can see, the denormalization formula is quite complex. The hardware also needs to be able to detect whether a number falls into the denormalized interval and choose the appropriate representation for that number. The amount of hardware required to implement denormalization in high speed is quite significant. Implementations that use a moderate amount of hardware often introduce thousands of clock cycles of delay whenever a denormalized number needs to be generated or used. This was the reason why early generations of CUDA devices did not support denormalization. However, virtually all recent generations of CUDA devices, thanks to the increasing number of available transistors of more recent fabrication processes, support denormalization. More specifically, all CUDA devices of compute capability 1.3 and later support denormalized double-precision operands, and all devices of compute capability 2.0 and later support denormalized single-precision operands.
总之,浮点表示的精度是通过将浮点数表示为可表示数字之一时引入的最大误差来衡量的。误差越小,精度越高。通过向尾数添加更多位可以提高浮点表示的精度。在尾数的表示中添加 1 位可以改善通过将最大误差减少一半来提高精度。因此,当尾数使用更多位时,数字系统具有更高的精度。这反映在 IEEE 标准中的双精度数字与单精度数字中。
In summary, the precision of a floating-point representation is measured by the maximal error that we can introduce to a floating-point number by representing that number as one of the representable numbers. The smaller the error is, the higher the precision. The precision of a floating-point representation can be improved by adding more bits to mantissa. Adding 1 bit to the representation of the mantissa improves the precision by reducing the maximal error by half. Thus, a number system has higher precision when it uses more bits for mantissa. This is reflected in double-precision versus single-precision numbers in the IEEE standard.
我们现在转向实际 IEEE 格式的更具体细节。当所有指数位均为 1 时,如果尾数为 0,则表示的数字为无穷大值。如果尾数不为 0,则表示的数字为非数字 (NaN)。IEEE 浮点格式的所有特殊位模式在图7.8。
We now turn to more specific details of the actual IEEE format. When all exponent bits are 1’s, the number represented is an infinity value if the mantissa is 0. It is a not a number (NaN) if the mantissa is not 0. All special bit patterns of the IEEE floating-point format are described in Figure 7.8.
图 7.8 IEEE 标准格式中的特殊位模式。
Figure 7.8 Special bit patterns in the IEEE standard format.
所有其他数字都是标准化浮点数。单精度数有 1 位S、8 位E和 23 位M。双精度数有 1 位S、11 位E和 52 位M。由于双精度数多了29位尾数,因此表示数字的最大误差减少到单精度格式的1/2 29 !通过增加 3 位指数,双精度格式还扩展了可表示数字的区间数量。这将可表示数字的范围扩展到非常大和非常小的值。
All other numbers are normalized floating-point numbers. Single-precision numbers have 1-bit S, 8-bit E, and 23-bit M. Double-precision numbers have 1-bit S, 11-bit E, and 52-bit M. Since a double-precision number has 29 more bits for mantissa, the largest error for representing a number is reduced to 1/229 of that of the single-precision format! With the additional 3 bits of exponent, the double-precision format also extends the number of intervals of representable numbers. This extends the range of representable numbers to very large as well as very small values.
所有可表示的数字都位于−∞(负无穷大)和+∞(正无穷大)之间。无穷大可以通过溢出来创建,例如,一个大数除以一个非常小的数。任何可表示的数字除以 +∞ 或 −∞ 都会得到 0。
All representable numbers fall between −∞ (negative infinity) and +∞ (positive infinity). An ∞ can be created by overflow, for example, a large number divided by a very small number. Any representable number divided by +∞ or −∞ results in 0.
NaN 是由输入值无意义的运算生成的,例如 0/0、0×∞、∞/∞、∞−∞。它们还用于程序中未正确初始化的数据。 IEEE 标准中有两种类型的 NaN:信号型 NaN 和安静型 NaN。信令 NaN (sNaN) 应使用清除的最高有效尾数位来表示,而安静 NaN (qNaN) 应使用设置的最高有效尾数位来表示。
NaN is generated by operations of which the input values do not make sense, for example, 0/0, 0×∞, ∞/∞, ∞−∞. They are also used for data that has not been properly initialized in a program. There are two types of NaNs in the IEEE standard: signaling and quiet. Signaling NaNs (sNaNs) should be represented with the most significant mantissa bit cleared, whereas quiet NaNs (qNaNs) are represented with the most significant mantissa bit set.
sNaN 在用作算术运算的输入时会导致异常。例如,操作 (1.0 + sNaN) 向操作系统发出异常信号。信号 NaN 用于程序员希望确保在浮点计算中使用任何 NaN 值时中断程序执行的情况。这些情况通常意味着程序的执行出现了问题。在关键任务应用程序中,只有通过验证执行的有效性才能继续执行。单独的手段。例如,软件工程师经常将所有未初始化的数据标记为 sNaN。这种做法确保在程序执行期间检测到使用未初始化的数据。当前一代的 GPU 硬件不支持 sNaN。这是因为在大规模并行执行期间很难支持准确的信号发送。
An sNaN causes an exception when used as input to arithmetic operations. For example, the operation (1.0 + sNaN) raises an exception signal to the operating system. Signaling NaNs are used in situations where the programmer would like to make sure that the program execution be interrupted whenever any NaN values are used in floating-point computations. These situations usually mean that there is something wrong with the execution of the program. In mission-critical applications, the execution cannot continue until the validity of the execution can be verified with a separate means. For example, software engineers often mark all the uninitialized data as sNaN. This practice ensures the detection of using uninitialized data during program execution. The current generation of GPU hardware does not support sNaN. This is due to the difficulty of supporting accurate signaling during massively parallel execution.
当用作算术运算的输入时,qNaN 会生成另一个 qNaN,而不会导致异常。例如,运算 (1.0 + qNaN) 生成 qNaN。安静 NaN 通常用于用户可以查看输出并决定是否应使用不同输入重新运行应用程序以获得更有效结果的应用程序。打印结果时,qNaN 会打印为“NaN”,以便用户可以轻松在输出文件中找到它们。
A qNaN generates another qNaN without causing an exception when used as input to arithmetic operations. For example, the operation (1.0 + qNaN) generates a qNaN. Quiet NaNs are typically used in applications where the user can review the output and decide if the application should be rerun with a different input for more valid results. When the results are printed, qNaNs are printed as “NaN” so that the user can spot them in the output file easily.
现在我们已经很好地了解了 IEEE 浮点格式,我们准备讨论算术精度的概念。虽然精度由浮点数格式中使用的尾数位数决定,但精度由对浮点数执行的运算决定。浮点算术运算的精度是通过运算引入的最大误差来衡量的。误差越小,准确度越高。浮点运算中最常见的错误来源是运算生成无法精确表示的结果,因此需要舍入。如果结果值的尾数需要太多位才能准确表示,则会发生舍入。例如,乘法生成的乘积值的位数是任一输入值的两倍。再例如,如果两个浮点值是相同的指数,则可以通过将它们的尾数值相加来完成两个浮点值的相加。当浮点加法的两个输入操作数具有不同的指数时,指数较小的那个的尾数将被重复除以2或右移(即所有尾数位向右移动1位位置),直到指数相等。因此,最终结果的位数可能超出格式所能容纳的位数。
Now that we have a good understanding of the IEEE floating-point format, we are ready to discuss the concept of arithmetic accuracy. While the precision is determined by the number of mantissa bits used in a floating-point number format, the accuracy is determined by the operations performed on a floating number. The accuracy of a floating-point arithmetic operation is measured by the maximal error introduced by the operation. The smaller the error is, the higher the accuracy. The most common source of error in floating-point arithmetic is when the operation generates a result that cannot be exactly represented and thus requires rounding. Rounding occurs if the mantissa of the result value needs too many bits to be represented exactly. For example, a multiplication generates a product value that consists of twice the number of bits than either of the input values. For another example, adding two floating-point numbers can be done by adding their mantissa values together if the two floating-point values are identical exponents. When two input operands to a floating-point addition have different exponents, the mantissa of the one with the smaller exponent is repeatedly divided by 2 or right-shifted (i.e., all the mantissa bits are shifted to the right by 1 bit position) until the exponents are equal. As a result, the final result can have more bits than the format can accommodate.
操作数的对齐移位可以用基于图 7.4中 5 位表示的简单示例来说明。假设我们需要将 1.00 B −2 −2 (0, 00, 01) 添加到 1.00×2 1 D (0, 10, 00);也就是说,我们需要执行 1.00 B ×2 1 +1.00 B ×2 −2。由于指数的差异值时,第二个数的尾数值需要右移 3 位位置,然后才能与第一个尾数值相加。即,相加变成1.00B × 2 1 +0.001B × 2 1。现在可以通过将尾数值相加来执行加法。理想的结果是 1.001 B ×2 1。然而,我们可以看到,这个理想的结果并不是一个可以用 5 位表示的数字。它需要三个尾数位,而格式中只有两个尾数位。因此,最好的做法是生成最接近的可表示数字之一,即 1.01 B ×2 1或 1.00 B ×2 1。通过这样做,我们引入了一个误差 0.001 B ×2 1,它是最不重要位置的位置值的一半。我们将此称为 0.5 D ULP(最后一位的单位)。如果硬件被设计为完美地执行算术和舍入运算,则引入的最大误差不应超过 0.5 D ULP。据我们所知,这是当今所有 CUDA 设备中加法和减法运算所达到的精度。
Alignment shifting of operands can be illustrated with a simple example based on the 5-bit representation in Figure 7.4. Assume that we need to add 1.00B −2−2(0, 00, 01) to 1.00×21D (0, 10, 00); that is, we need to perform 1.00B×21+1.00B ×2−2. Due to the difference in exponent values, the mantissa value of the second number needs to be right-shifted by 3 bit positions before it is added to the first mantissa value. That is, the addition becomes 1.00B×21+0.001B×21. The addition can now be performed by adding the mantissa values together. The ideal result would be 1.001B×21. However, we can see that this ideal result is not a representable number in a 5-bit representation. It would have required three mantissa bits and there are only two mantissa bits in the format. Thus, the best one can do is to generate one of the closest representable numbers, which is either 1.01B×21 or 1.00B×21. By doing so, we introduce an error, 0.001B×21, which is half the place value of the least significant place. We refer to this as 0.5D ULP (units in the last place). If the hardware is designed to perform arithmetic and rounding operations perfectly, the most errors that one should introduce should be no more than 0.5D ULP. To our knowledge, this is the accuracy achieved by the addition and subtraction operations in all CUDA devices today.
在实践中,一些更复杂的算术硬件单元,例如除法和超越函数,通常使用多项式逼近算法来实现。如果硬件在近似中没有使用足够数量的项,则结果的误差可能大于 0.5 D ULP。例如,如果求逆运算的理想结果是 1.00 B ×2 1,但硬件由于使用近似算法而生成 1.10 B ×2 1 ,则误差为 2 D ULP,因为误差 (1.10 B −1.00 B =0.10 B ) 是最后一位单位 (0.01 B ) 的两倍。实际上,一些早期设备中的硬件反转操作会引入尾数最小位置值的两倍或 2 ULP 的误差。由于最近几代 CUDA 设备中晶体管更加丰富,它们的硬件算术运算更加准确。
In practice, some of the more complex arithmetic hardware units, such as division and transcendental functions, are typically implemented with polynomial approximation algorithms. If the hardware does not use a sufficient number of terms in the approximation, the result may have an error larger than 0.5D ULP. For example, if the ideal result of an inversion operation is 1.00B×21 but the hardware generates a 1.10B×21 due to the use of an approximation algorithm, the error is 2D ULP since the error (1.10B−1.00B=0.10B) is two times bigger than the units in the last place (0.01B). In practice, the hardware inversion operations in some early devices introduce an error that is twice the place value of the least place of the mantissa, or 2 ULP. Thanks to the more abundant transistors in more recent generations of CUDA devices, their hardware arithmetic operations are much more accurate.
数值算法通常需要对大量值进行求和。例如,矩阵乘法中的点积需要对输入矩阵元素的两两乘积求和。理想情况下,对这些值求和的顺序不应影响最终总计,因为加法是关联运算。然而,在精度有限的情况下,对这些值求和的顺序会影响最终结果的准确性。例如,如果我们需要对 5 位表示中的四个数字执行求和约简: 1.00 B ×2 0 +1.00 B ×2 0 +1.00 B ×2 −2 +1.00 B ×2 −2。
Numerical algorithms often need to sum up a large number of values. For example, the dot product in matrix multiplication needs to sum up pairwise products of input matrix elements. Ideally, the order of summing these values should not affect the final total since addition is an associative operation. However, with finite precision, the order of summing these values can affect the accuracy of the final result. For example, if we need to perform a sum reduction on four numbers in our 5-bit representation: 1.00B×20+1.00B×20+1.00B×2−2+1.00B×2−2.
如果我们按照严格的顺序将数字相加,我们将得到以下操作序列:
If we add up the numbers in strict sequential order, we have the following sequence of operations:
请注意,在第二步和第三步中,较小的操作数会消失,因为它们与较大的操作数相比太小。
Note that in the second and third step, the smaller operand simply disappears because they are too small compared to the larger operand.
现在,让我们考虑一个并行算法,其中前两个值相加,后两个操作数并行相加。然后该算法将两两求和相加:
Now, let’s consider a parallel algorithm where the first two values are added and the second two operands are added in parallel. The algorithm then adds up the pairwise sum:
请注意,结果与顺序结果不同!这是因为第三个和第四个值的总和足够大,现在会影响加法结果。顺序算法和并行算法之间的这种差异常常让不熟悉浮点精度和准确度注意事项的应用程序开发人员感到惊讶。尽管我们展示了并行算法产生比顺序算法更准确的结果的场景,但读者应该能够想出稍微不同的场景,其中并行算法产生的结果不如顺序算法准确。经验丰富的应用程序开发人员要么确保最终结果的变化是可以容忍的,要么确保以并行算法产生最准确结果的方式对数据进行排序或分组。
Note that the results are different from the sequential result! This is because the sum of the third and fourth values is large enough that it now affects the addition result. This discrepancy between sequential algorithms and parallel algorithms often surprises application developers who are not familiar with floating-point precision and accuracy considerations. Although we showed a scenario where a parallel algorithm produced a more accurate result than a sequential algorithm, readers should be able to come up with a slightly different scenario where the parallel algorithm produces a less accurate result than a sequential algorithm. Experienced application developers either make sure that the variation in the final result can be tolerated, or ensure that the data is sorted or grouped in a way that the parallel algorithm results in the most accurate results.
最大化浮点运算精度的常用技术是在归约计算之前对数据进行预排序。在我们的求和缩减示例中,如果我们根据升序对数据进行预排序,我们将得到以下结果:
A common technique to maximize floating-point arithmetic accuracy is to presort data before a reduction computation. In our sum reduction example, if we presort the data according to ascending numerical order, we will have the following:
当我们在并行算法中将数字分组时,假设第一对在一组中,第二对在另一组中,数值彼此接近的数字在同一组中。显然,在预排序过程中需要考虑数字的符号。因此,当我们在这些组中进行加法时,我们可能会得到准确的结果。此外,一些并行算法使用每个线程依次减少每个线程中的值团体。将数字按升序排序可以允许顺序相加以获得更高的准确性。这就是为什么排序经常用于大规模并行数值算法的原因。有兴趣的读者应该学习更先进的技术,例如补偿求和算法,也称为卡汉求和算法,以获得更稳健的方法来精确求和浮点值[Kahan1965]。
When we divide up the numbers into groups in a parallel algorithm, say the first pair in one group and the second pair in another group, numbers with numerical values close to each other are in the same group. Obviously, the sign of the numbers needs to be taken into account during the presorting process. Therefore, when we perform addition in these groups, we will likely have accurate results. Furthermore, some parallel algorithms use each thread to sequentially reduce values within each group. Having the numbers sorted in ascending order allows a sequential addition to get higher accuracy. This is a reason why sorting is frequently used in massively parallel numerical algorithms. Interested readers should study more advanced techniques such as compensated summation algorithm, also known as Kahan’s summation algorithm, for getting an even more robust approach to accurate summation of floating-point values [Kahan1965].
虽然运算顺序可能会导致归约运算的数值结果发生变化,但它可能对某些类型的计算(例如线性方程组的求解器)产生更严重的影响。在这些求解器中,不同的输入数值可能需要不同的运算顺序才能找到解决方案。如果算法无法遵循输入所需的操作顺序,则即使解存在,它也可能无法找到解。只要对于任何给定的输入值都存在问题,总是能够找到适当的运算顺序并因此找到问题的解决方案的算法称为数值稳定的。达不到要求的算法被称为数值不稳定的算法。
While the order of operations may cause variation in the numerical outcome of reduction operations, it may have even more serious implications on some types of computation such as solvers for linear systems of equations. In these solvers, different numerical values of input may require different ordering of operations to find a solution. If an algorithm fails to follow a desired order of operations for an input, it may fail to find a solution even though the solution exists. Algorithms that can always find an appropriate operation order and thus find a solution to the problem as long as it exists for any given input values are called numerically stable. Algorithms that fall short are referred to as numerically unstable.
在某些情况下,考虑数值稳定性可能会使为计算问题找到有效的并行算法变得更加困难。我们可以用基于高斯消去法的求解器来说明这种现象。考虑以下线性方程组:
In some cases, numerical stability considerations can make it more difficult to find efficient parallel algorithms for a computational problem. We can illustrate this phenomenon with a solver that is based on Gaussian elimination. Consider the following system of linear equations:
(等式 1)
(equation 1)
(等式2)
(equation 2)
(等式 3)
(equation 3)
只要这些方程所表示的三个平面有交点,我们就可以利用高斯消元法求出给出交点坐标的解。我们在图 7.9中展示了将高斯消去法应用于该系统的过程,其中变量被系统地从较低位置的方程中消除。
As long as the three planes represented by these equations have an intersection point, we can use Gaussian elimination to derive the solution that gives the coordinate of the intersection point. We show the process of applying Gaussian elimination to this system in Figure 7.9, where variables are systematically eliminated from lower positioned equations.
图 7.9求解线性方程组的高斯消去法和向后代入法。
Figure 7.9 Gaussian elimination and backward substitution for solving systems of linear equations.
在第一步中,所有方程除以X变量的系数:方程 1为 3 ,方程2为2 ,方程 3为1。这使得所有方程中X的系数相同。在第二步中,方程 1为从方程 2和3中减去。这些减法消除了方程 2和3中的变量X ,如图7.9所示。
In the first step, all equations are divided by their coefficient for the X variable: 3 for equation 1, 2 for equation 2, and 1 for equation 3. This makes the coefficients for X in all equations the same. In step two, equation 1 is subtracted from equations 2 and 3. These subtractions eliminate variable X from equations 2 and 3, as shown in Figure 7.9.
现在,我们可以将方程 2和方程3视为一个较小的方程组,其变量比原始方程少一个。由于它们没有变量X ,因此可以独立于方程 1来求解它们。我们可以通过从方程 3中消除变量Y来取得更多进展。这是在步骤 3 中通过将方程 2和方程3除以其Y变量的系数来完成的:方程 2为 -1/6,方程 3为 1/3 。这使得等式 2和3中Y的系数相同。在第四步中,从等式3中减去等式2,这从等式3中消除了变量Y。
We can now treat equations 2 and 3 as a smaller system of equations with one fewer variable than the original equation. Since they do not have variable X, they can be solved independently from equation 1. We can make more progress by eliminating variable Y from equation 3. This is done in step 3 by dividing equations 2 and 3 by the coefficients for their Y variables: −1/6 for equation 2 and 1/3 for equation 3. This makes the coefficients for Y in both equations 2 and 3 the same. In step four, equation 2 is subtracted from equation 3, which eliminates variable Y from equation 3.
对于方程数量较多的系统,该过程会重复更多次。然而,由于本例中只有三个变量,因此方程 3只有Z变量。我们只需要划分方程 3通过变量Z的系数。这方便地为我们提供了解Z = 3。
For systems with a larger number of equations, the process would be repeated more. However, since we have only three variables in this example, the equation 3 has only the Z variable. We simply need to divide equation 3 by the coefficient for variable Z. This conveniently gives us the solution Z=3.
有了Z变量的解,我们可以将Z值代入方程 2,得到解Y =2。然后我们可以将Z =3 和Y =2 代入方程 1以获得解X =1。我们现在已经有了原始系统的完整解决方案。很明显为什么步骤六和七构成了称为向后替换的方法的第二阶段。我们从最后一个方程向后推到第一个方程,以获得越来越多变量的解。
With the solution for the Z variable in hand, we can substitute the Z value into equation 2 to get the solution Y=2. We can then substitute both Z=3 and Y=2 into equation 1 to get the solution X=1. We now have the complete solution for the original system. It should be obvious why steps six and seven form the second phase of the method called backward substitution. We go backwards from the last equation to the first equation to get solutions for more and more variables.
一般来说,方程以矩阵形式存储在计算机中。由于所有计算只涉及系数和右侧值,因此我们可以将这些系数和右侧值存储在矩阵中。图 7.10显示了高斯消除和回代过程的矩阵视图。矩阵的每一行对应一个原始方程。对方程的运算变成对矩阵行的运算。
In general, the equations are stored in matrix forms in computers. Since all calculations only involve the coefficients and the right-side values, we can just store these coefficients and right-side values in a matrix. Figure 7.10 shows the matrix view of the Gaussian elimination and back substitution process. Each row of the matrix corresponds to an original equation. Operations on equations become operations on matrix rows.
图 7.10矩阵视图中的高斯消除和向后替换。
Figure 7.10 Gaussian elimination and backward substitution in matrix view.
经过高斯消去后,矩阵就变成了三角矩阵。由于各种物理和数学原因,这是一种非常流行的矩阵类型。我们看到最终的目标是将矩阵的系数部分变成对角形式,其中每一行对角线上只有一个值1。这称为单位矩阵,因为任何矩阵乘以单位矩阵的结果都是其本身。这也是为什么对矩阵进行高斯消元相当于将矩阵与其逆矩阵相乘的原因。
After Gaussian elimination, the matrix becomes a triangular matrix. This is a very popular type of matrix for various physics and mathematics reasons. We see that the end goal is to make the coefficient part of the matrix into a diagonal form, where each row has only a value 1 on the diagonal line. This is called an identity matrix because the result of multiplying any matrix multiplied by an identity matrix is itself. This is also the reason why performing Gaussian elimination on a matrix is equivalent to multiplying the matrix by its inverse matrix.
一般来说,为我们在图 7.10中描述的高斯消除过程设计一个并行算法是很简单的。例如,我们可以编写一个 CUDA 内核并指定每个线程来执行对矩阵的一行进行的所有计算。对于可以放入共享内存的系统,我们可以使用线程块来执行高斯消除。所有线程都会迭代这些步骤。在每个划分步骤之后,所有线程都参与屏障同步。然后,它们都执行减法步骤,之后一个线程将停止其参与,因为其指定的行在回代阶段之前没有更多工作要做。在减法步骤之后,所有线程需要再次执行屏障同步,以确保使用更新的信息来完成下一步。对于具有许多变量的方程组,我们可以预期并行执行会带来合理的加速。
In general, it is straightforward to design a parallel algorithm for the Gaussian elimination procedure that we described in Figure 7.10. For example, we can write a CUDA kernel and designate each thread to perform all calculations to be done on a row of the matrix. For systems that can fit into shared memory, we can use a thread block to perform Gaussian elimination. All threads iterate through the steps. After each division step, all threads participate in barrier synchronization. They then all perform a subtraction step, after which one thread will stop its participation since its designated row has no more work to do until the back substitution phase. After the subtraction step, all threads need to perform barrier synchronization again to ensure that the next step will be done with the updated information. With systems of equations with many variables, we can expect a reasonable amount of speedup from the parallel execution.
不幸的是,我们一直使用的简单高斯消除算法可能会受到数值不稳定的影响。这可以用下面的例子来说明。
Unfortunately, the simple Gaussian elimination algorithm we have been using can suffer from numerical instability. This can be illustrated with the following example.
(等式 1)
(equation 1)
(等式2)
(equation 2)
(等式 3)
(equation 3)
当我们执行算法的第一步时,我们会遇到一个问题。等式 1中X变量的系数为 0。我们无法将等式 1除以变量X 的系数,也无法通过从等式 2和3中减去等式 1来消除等式 2和3中的X变量。读者应验证该方程组可解并且具有相同的解X =1、Y =2 和Z =3。因此,该算法在数值上不稳定。即使解存在,它也可能无法为某些输入值生成解。
We will encounter a problem when we perform step one of the algorithm. The coefficient for the X variable in equation 1 is 0. We will not be able to divide equation 1 by the coefficient for variable X and eliminate the X variable from equations 2 and 3 by subtracting equation 1 from equations 2 and 3. Readers should verify that this system of equation is solvable and has the same solution X=1, Y=2, and Z=3. Therefore, the algorithm is numerically unstable. It can fail to generate a solution for certain input values even though the solution exists.
这是高斯消除算法的一个众所周知的问题,可以通过通常称为旋转的方法来解决。这个想法是找到剩余方程之一,其中主导变量的系数不为0。通过将当前顶部方程与识别的方程交换,该算法可以成功地从其余方程中消除主导变量。如果我们将旋转应用于这三个方程,我们最终会得到以下集合。
This is a well-known problem with Gaussian elimination algorithms and can be addressed with a method commonly referred to as pivoting. The idea is to find one of the remaining equations of which the coefficient for the lead variable is not 0. By swapping the current top equation with the identified equation, the algorithm can successfully eliminate the lead variable from the rest of the equations. If we apply pivoting to the three equations, we end up with the following set.
(方程1′,原方程2)
(equation 1′, original equation 2)
(方程2′,原方程1)
(equation 2′, original equation 1)
(方程3′,原方程3)
(equation 3′, original equation 3)
请注意,方程 1′中X的系数不再为 0。我们可以继续进行高斯消元法,如图7.11所示。
Note that the coefficient for X in equation 1′ is no longer 0. We can proceed with Gaussian elimination, as illustrated in Figure 7.11.
图 7.11带旋转的高斯消除。
Figure 7.11 Gaussian elimination with pivoting.
读者应遵循图 7.11中的步骤。最重要的附加见解是,某些方程可能不具有算法在当前步骤中消除的变量(请参阅第 1 步中的第 2 行)图 7.11)。指定的线程不需要对方程进行除法。
Readers should follow the steps in Figure 7.11. The most important additional insight is that some equations may not have the variable that the algorithm is eliminating at the current step (see row 2 of step one in Figure 7.11). The designated thread does not need to do the division on the equation.
一般来说,旋转步骤应该选择所有主导变量中绝对系数值最大的方程,并将其方程(行)与当前顶部方程交换,以及将变量(列)与当前变量交换。虽然旋转在概念上很简单,但它可能会导致显着的实现复杂性和性能开销。在我们简单的 CUDA 内核实现中,回想一下每个线程都分配了一行。旋转需要检查并可能交换分布在这些线程上的系数数据。如果所有系数都在共享内存中,这不是一个大问题。只要我们控制扭曲内控制流发散的水平,我们就可以使用块中的线程运行并行缩减。
In general, the pivoting step should choose the equation with the largest absolute coefficient value among all the lead variables and swap its equation (row) with the current top equation, as well as swap the variable (column) with the current variable. While pivoting is conceptually simple, it can incur significant implementation complexity and performance overhead. In the case of our simple CUDA kernel implementation, recall that each thread is assigned a row. Pivoting requires an inspection and perhaps swapping of coefficient data spread across these threads. This is not a big problem if all coefficients are in the shared memory. We can run a parallel reduction using threads in the block as long as we control the level of control flow divergence within warps.
然而,如果线性方程组由多个线程块甚至计算集群的多个节点求解,则检查分布在多个线程块或多个计算集群节点上的数据的想法可能是一个极其昂贵的提议。这是避免通信算法的主要动机,这些算法避免了数据的全局检查,例如数据透视[Ballard2011]。一般来说,有两种方法可以解决这个问题。部分枢转将交换操作的候选限制为来自一组局部方程,从而限制了全局检查的成本。然而,这可能会稍微降低解决方案的数值精度。研究人员还证明,随机化往往可以保持解决方案的高水平数值精度。
However, if the system of linear equations is being solved by multiple thread blocks or even multiple nodes of a compute cluster, the idea of inspecting data spread across multiple thread blocks or multiple compute cluster nodes can be an extremely expensive proposition. This is the main motivation for communication-avoiding algorithms that avoid a global inspection of data such as pivoting [Ballard2011]. In general, there are two approaches to this problem. Partial pivoting restricts the candidates of the swap operation to come from a localized set of equations so that the cost of global inspection is limited. This can, however, slightly reduce the numerical accuracy of the solution. Researchers have also demonstrated that randomization tends to maintain a high level of numerical accuracy for the solution.
本章介绍了浮点格式和可表示数字的概念,它们是理解精度的基础。基于这些概念,我们还解释了非规范化数以及它们在许多数值应用中的重要性。在早期的 CUDA 设备中,不支持非规范化数字。但是,后来的硬件世代支持非规范化数字。我们还解释了浮点运算的算术精度的概念。这对于 CUDA 程序员了解在特殊功能单元中实现的快速算术运算的潜在较低精度非常重要。更重要的是,读者现在应该很好地理解为什么并行算法常常会影响计算的准确性。计算结果以及如何使用排序和其他技术来提高计算的准确性。
This chapter introduced the concepts of floating-point format and representable numbers that are foundational to the understanding of precision. Based on these concepts, we also explained the denormalized numbers and why they are important in many numerical applications. In early CUDA devices, denormalized numbers were not supported. However, later hardware generations support denormalized numbers. We have also explained the concept of arithmetic accuracy of floating-point operations. This is important for CUDA programmers to understand the potential lower accuracy of fast arithmetic operations implemented in the special function units. More importantly, readers should now have a good understanding of why parallel algorithms often can affect the accuracy of calculation results and how one can potentially use sorting and other techniques to improve the accuracy of their computation.
7.1. 画出与图 7.5相同的6 位格式(1 位符号、3 位尾数、2 位指数)。使用您的结果来解释每个附加尾数位对数轴上可表示的数字集的作用。
7.1. Draw the equivalent of Figure 7.5 for a 6-bit format (1-bit sign, 3-bit mantissa, 2-bit exponent). Use your result to explain what each additional mantissa bit does to the set of representable numbers on the number line.
7.2. 为另一种 6 位格式(1 位符号、2 位尾数、3 位指数)绘制与图 7.5等效的图。使用您的结果来解释每个附加指数位对数轴上可表示的数字集的作用。
7.2. Draw the equivalent of Figure 7.5 for another 6-bit format (1-bit sign, 2-bit mantissa, 3-bit exponent). Use your result to explain what each additional exponent bit does to the set of representable numbers on the number line.
7.3. 假设在新的处理器设计中,由于技术难度,执行加法的浮点运算单元只能进行“舍入到零”(将值向0舍入)。硬件保持足够数量的位,唯一引入的错误是由于舍入引起的。此计算机上添加操作的最大ulp错误值是多少?
7.3. Assume that in a new processor design, due to technical difficulty, the floating-point arithmetic unit that performs addition can only do “round to zero” (rounding by truncating the value toward 0). The hardware maintains a sufficient number of bits that the only error introduced is due to rounding. What is the maximal ulp error value for add operations on this machine?
7.4. 一名研究生编写了一个 CUDA 内核,用于将大型浮点数组减少为其所有元素的总和。数组将始终按最小值到最大值排序。为了避免分支发散,他决定实现图6.4的算法。解释为什么这会降低他的结果的准确性。
7.4. A graduate student wrote a CUDA kernel to reduce a large floating-point array to the sum of all its elements. The array will always be sorted with the smallest values to the largest values. To avoid branch divergence, he decided to implement the algorithm of Figure 6.4. Explain why this can reduce the accuracy of his results.
7.5。 假设在算术单元设计中,硬件实现迭代近似算法,该算法在每个时钟周期为sin()函数生成结果的两个附加精确尾数位。架构师决定允许算术函数迭代九个时钟周期。假设硬件用零填充所有剩余的尾数位。对于 IEEE 单精度数,此设计中sin()函数的硬件实现的最大ulp误差是多少?假设省略的1.尾数位也必须由运算单元生成。
7.5. Assume that in a arithmetic unit design, the hardware implements an iterative approximation algorithm that generates two additional accurate mantissa bits of the result for the sin() function in each clock cycle. The architect decided to allow the arithmetic function to iterate nine clock cycles. Assume that the hardware fill in all remaining mantissa bits with zeroes. What would be the maximal ulp error of the hardware implementation of the sin() function in this design for the IEEE single-precision numbers? Assume that the omitted 1. mantissa bit must also be generated by the arithmetic unit.
1. Ballard G、Demmel J、Holtz O、Schwartz O。《数值线性代数中的最小化通信》。SIAM J 矩阵分析应用程序。 2011;32(3):866–901。
1. Ballard G, Demmel J, Holtz O, Schwartz O. Minimizing communication in numerical linear algebra. SIAM J Matrix Analysis Applications. 2011;32(3):866–901.
2. IEEE 微处理器标准委员会。浮点运算标准草案 P754。 2008 年 1 月。
2. IEEE Microprocessor Standards Committee. Draft standard for floating-point arithmetic P754. January 2008.
3. Kahan W. 关于减少截断错误的进一步评论。ACM 的通讯。 1965;8(1):40。 doi 10.1145/363707.363723。
3. Kahan W. Further remarks on reducing truncation errors. Communications of the ACM. 1965;8(1):40. doi 10.1145/363707.363723.
8.1 背景
8.1 Background
8.2 一维并行卷积——基本算法
8.2 1D Parallel Convolution—A Basic Algorithm
8.3 恒定内存和缓存
8.3 Constant Memory and Caching
8.4 具有 Halo 元素的平铺一维卷积
8.4 Tiled 1D Convolution with Halo Elements
8.5 更简单的平铺一维卷积——通用缓存
8.5 A Simpler Tiled 1D Convolution—General Caching
8.6 概括
8.6 Summary
8.7 练习
8.7 Exercises
在接下来的几章中,我们将讨论一组重要的并行计算模式。这些模式是应用程序中出现的许多并行算法的基础。我们将从卷积开始,它是一种流行的数组运算,在信号处理、数字记录、图像处理、视频处理和计算机视觉中以多种形式使用。在这些应用领域中,卷积通常作为滤波器来执行,将信号和像素转换为更理想的值。例如,高斯滤波器是卷积滤波器,可用于锐化图像中对象的边界和边缘。其他滤波器可以平滑信号值,以便人们可以看到总体趋势。它们还构成了仿真模型中使用的大量力和能量计算算法的基础。卷积通常涉及对每个数据元素进行大量算术运算。对于高清图像、视频等大数据集,计算量可能会非常大。每个输出数据元素可以彼此独立地计算,这是大规模并行计算的理想特征。另一方面,输出数据之间存在大量的输入数据共享边界条件有些挑战性的元素。这使得卷积成为复杂的平铺方法和输入数据分级方法的重要用例。
In the next several chapters, we will discuss a set of important parallel computation patterns. These patterns are the basis of many parallel algorithms that appear in applications. We will start with convolution, which is a popular array operation that is used in various forms in signal processing, digital recording, image processing, video processing, and computer vision. In these application areas, convolution is often performed as a filter that transforms signals and pixels into more desirable values. For example, Gaussian filters are convolution filters that can be used to sharpen boundaries and edges of objects in images. Other filters smooth out the signal values so that one can see the big-picture trend. They also form the basis of a large number of force and energy calculation algorithms used in simulation models. Convolution typically involves a significant number of arithmetic operations on each data element. For large data sets such as high-definition images and videos, the amount of computation can be very large. Each output data element can be calculated independently of each other, a desirable trait for massively parallel computing. On the other hand, there is a substantial level of input data sharing among output data elements with somewhat challenging boundary conditions. This makes convolution an important use case of sophisticated tiling methods and input data staging methods.
从数学上讲,卷积是一种数组运算,其中每个输出数据元素是相邻输入元素集合的加权和。加权和计算中使用的权重由输入掩码数组定义,通常称为卷积核。由于 CUDA 内核函数和卷积内核之间存在不幸的名称冲突,因此我们将这些掩码数组称为卷积掩码以避免混淆。相同的卷积掩码通常用于数组的所有元素。
Mathematically, convolution is an array operation where each output data element is a weighted sum of a collection of neighboring input elements. The weights used in the weighted sum calculation are defined by an input mask array, commonly referred to as the convolution kernel. Since there is an unfortunate name conflict between the CUDA kernel functions and convolution kernels, we will refer to these mask arrays as convolution masks to avoid confusion. The same convolution mask is typically used for all elements of the array.
在音频数字信号处理中,输入数据为一维形式,并将信号量表示为时间的函数。图 8.1显示了 1D 数据的卷积示例,其中将五元素卷积掩码数组M应用于七元素输入数组N。我们将遵循 C 语言约定,其中N和P元素的索引从 0 到 6,M元素的索引从 0 到 4。事实上,我们使用五元素掩码M意味着每个P元素都是由加权和生成的对应的N 个元素,左侧最多两个元素,右侧最多两个元素。例如,P[2]的值被生成为N[0](N[2-2])到N[4](N[2+2])的加权和。在这个例子中,我们任意假设N个元素的值为1,2,3,…,7。M个元素定义了权重,在这个例子中其值为3,4,5,4,3 。每个权重值乘以对应的N 将乘积相加之前的元素值。如图8.1所示, P[2]的计算如下:
In audio digital signal processing, the input data are in 1D form and represent signal volume as a function of time. Figure 8.1 shows a convolution example for 1D data where a five-element convolution mask array M is applied to a seven-element input array N. We will follow the C language convention where N and P elements are indexed from 0 to 6 and M elements are indexed from 0 to 4. The fact that we use a five-element mask M means that each P element is generated by a weighted sum of the corresponding N element, up to two elements to the left and up to two elements to the right. For example, the value of P[2] is generated as the weighted sum of N[0] (N[2-2]) through N[4] (N[2+2]). In this example, we arbitrarily assume that the values of the N elements are 1, 2, 3, …, 7. The M elements define the weights, the values of which are 3, 4, 5, 4, and 3 in this example. Each weight value is multiplied to the corresponding N element values before the products are summed together. As shown in Figure 8.1, the calculation for P[2] is as follows:
图 8.1元素内部的一维卷积示例。
Figure 8.1 A 1D convolution example, inside elements.
P[2] = N[0]*M[0] + N[1]*M[1] + N[2]*M[2] + N[3]*M[3] + N[4]*中号[4]
P[2] = N[0]∗M[0] + N[1]∗M[1] + N[2]∗M[2] + N[3]∗M[3] + N[4]∗M[4]
= 1*3 + 2*4 + 3*5 + 4*4 + 5*3
= 1∗3 + 2∗4 + 3∗5 + 4∗4 + 5∗3
= 57
= 57
一般来说,掩码的大小往往是奇数,这使得加权和计算围绕正在计算的元素对称。也就是说,奇数个掩码元素定义要应用于输出元素的相应位置中的输入元素的权重,以及该位置每侧的相同数量的输入元素。在图 8.1中,掩码大小为五个元素,每个输出元素计算为相应输入元素、左侧两个元素和右侧两个元素的加权和。例如,P[2]计算为N[2]以及左侧的N[0]和N[1]以及右侧的N[3]和N[4]的加权和。
In general, the size of the mask tends to be an odd number, which makes the weighted sum calculation symmetric around the element being calculated. That is, an odd number of mask elements define the weight to be applied to the input element in the corresponding position of the output element along with the same number of input elements on each side of that position. In Figure 8.1, with a mask size of five elements, each output element is calculated as the weighted sum of the corresponding input element, two elements on the left and two elements on the right. For example, P[2] is calculated as the weighted sum of N[2] along with N[0] and N[1] on the left and N[3] and N[4] on the right.
在图 8.1中, P元素P[i]的计算可以看作是从N[i-2]开始的N子数组与M数组之间的内积。图 8.2显示了P[3]的计算。该计算与图 8.1的计算相比移动了N个元素。即P[3]的值是N[1] (N[3-2])到N[5] (3+2)的加权和。我们可以将P[3]的计算视为如下:
In Figure 8.1, the calculation for P element P[i] can be viewed as an inner product between the subarray of N that starts at N[i-2] and the M array. Figure 8.2 shows the calculation for P[3]. The calculation is shifted by one N element from that of Figure 8.1. That is the value of P[3] is the weighted sum of N[1] (N[3-2]) through N[5] (3+2). We can think of the calculation for P[3] as follows:
图8.2 1D卷积,P[3]的计算。
Figure 8.2 1D convolution, calculation of P[3].
P[3] = N[1]*M[0] + N[2]*M[1] + N[3]*M[2] + N[4]*M[3] + N[5]*中号[4]
P[3] = N[1]∗M[0] + N[2]∗M[1] + N[3]∗M[2] + N[4]∗M[3] + N[5]∗M[4]
= 2*3 + 3*4 + 4*5 + 5*4 + 6*3
= 2∗3 + 3∗4 + 4∗5 + 5∗4 + 6∗3
= 76
= 76
由于卷积是根据相邻元素定义的,因此靠近数组末端的输出元素自然存在边界条件。如图8.3所示,当我们计算P[1]时, N[1]左边只有一个N元素。也就是说,不存在根据我们的卷积定义,有足够的N个元素来计算P[1] 。处理此类边界条件的典型方法是为这些缺失的N个元素定义默认值。对于大多数应用程序,默认值是 0,这就是我们在图 8.3中使用的值。例如,在音频信号处理中,我们可以假设录音开始之前和结束之后信号音量为0。此时P[1]的计算如下:
Because convolution is defined in terms of neighboring elements, boundary conditions naturally exist for output elements that are close to the ends of an array. As shown in Figure 8.3, when we calculate P[1], there is only one N element to the left of N[1]. That is, there are not enough N elements to calculate P[1] according to our definition of convolution. A typical approach to handling such a boundary condition is to define a default value to these missing N elements. For most applications, the default value is 0, which is what we use in Figure 8.3. For example, in audio signal processing, we can assume that the signal volume is 0 before the recording starts and after it ends. In this case, the calculation of P[1] is as follows:
图 8.3一维卷积边界条件。
Figure 8.3 1D convolution boundary condition.
P[1] = 0 * M[0] + N[0]*M[1] + N[1]*M[2] + N[2]*M[3] + N[3]*M[4 ]
P[1] = 0 ∗ M[0] + N[0]∗M[1] + N[1]∗M[2] + N[2]∗M[3] + N[3]∗M[4]
= 0 * 3 + 1 * 4 + 2 * 5 + 3 * 4 + 4 * 3
= 0 ∗ 3 + 1∗4 + 2∗5 + 3∗4 + 4∗3
= 38
= 38
本次计算中不存在的N元素如图8.3中的虚线框所示。应该清楚的是,P[0]的计算将涉及两个缺失的N个元素,在本例中,这两个元素都被假设为0。我们将P[0]的计算留作练习。这些缺失的元素在文献中通常被称为幽灵元素。由于在并行计算中使用平铺,还存在其他类型的幻影元素。这些幽灵元素会对平铺的复杂性和/或效率产生重大影响。我们很快就会回到这一点。
The N element that does not exist in this calculation is illustrated as a dashed box in Figure 8.3. It should be clear that the calculation of P[0] will involve two missing N elements, both of which will be assumed to be 0 for this example. We leave the calculation of P[0] as an exercise. These missing elements are typically referred to as ghost elements in literature. There are also other types of ghost elements due to the use of tiling in parallel computation. These ghost elements can have significant impact on the complexity and/or efficiency of tiling. We will come back to this point soon.
对于图像处理和计算机视觉,输入数据通常是 2D 形式,像素位于xy空间中。图像卷积也是二维的,如图8.4所示。在 2D 卷积中,掩码M是一个 2D 数组。它的x和y维度确定了要包含在加权和计算中的邻居的范围。在图 8.4中,为了简单起见,我们使用 5×5 掩模。一般来说,掩码不必是方形阵列。为了生成输出元素,我们采用其中心位于输入数组N中相应位置的子数组。然后我们在掩码数组的元素和掩码数组的元素之间执行成对乘法。对于我们的示例,结果显示为图 8.4中N和P数组下方的 5×5 乘积数组。输出元素的值是乘积数组所有元素的总和。
For image processing and computer vision, input data is usually in 2D form, with pixels in an x-y space. Image convolutions are also two dimensional, as illustrated in Figure 8.4. In a 2D convolution, the mask M is a 2D array. Its x and y dimensions determine the range of neighbors to be included in the weighted sum calculation. In Figure 8.4, we use a 5×5 mask for simplicity. In general, the mask does not have to be a square array. To generate an output element, we take the subarray of which the center is at the corresponding location in the input array N. We then perform pairwise multiplication between elements of the mask array and those of the mask array. For our example, the result is shown as the 5×5 product array below N and P arrays in Figure 8.4. The value of the output element is the sum of all elements of the product array.
图 8.4 2D 卷积示例。
Figure 8.4 A 2D convolution example.
图 8.4中的示例显示了 P 2,2的计算。为了简洁起见,我们将在寻址 C 数组时使用 N y,x来表示N[y][x] 。由于N和P很可能是动态分配的数组,因此我们将在实际代码示例中使用线性化索引。用于计算P 2,2的值的N的子阵列在x或水平方向上从N 0,0到N 0,4跨度并且在y或垂直方向上从N 0,0到N 4,0跨度。计算如下:
The example in Figure 8.4 shows the calculation of P2,2. For brevity, we will use Ny,x to denote N[y][x] in addressing a C array. Since N and P are most likely dynamically allocated arrays, we will be using linearized indices in our actual code examples. The subarray of N for calculating the value of P2,2 span from N0,0 to N0,4 in the x or horizontal direction and N0,0 to N4,0 in the y or vertical direction. The calculation is as follows:
P2,2 = N0,0*M0,0 + N0,1*M0,1 + N0,2*M0,2 + N0,3*M0,3 + N0,4*M0,4
P2,2 = N0,0∗M0,0 + N0,1∗M0,1 + N0,2∗M0,2 + N0,3∗M0,3 + N0,4∗M0,4
+ N1,0*M1,0 + N1,1*M1,1 + N1,2*M1,2 + N1,3*M1,3 + N1,4*M1,4
+ N1,0∗M1,0 + N1,1∗M1,1 + N1,2∗M1,2 + N1,3∗M1,3 + N1,4∗M1,4
+ N2,0*M2,0 + N2,1*M2,1 + N2,2*M2,2 + N2,3*M2,3 + N2,4*M2,4
+ N2,0∗M2,0 + N2,1∗M2,1 + N2,2∗M2,2 + N2,3∗M2,3 + N2,4∗M2,4
+ N3,0*M3,0 + N3,1*M3,1 + N3,2*M3,2 + N3,3*M3,3 + N3,4*M3,4
+ N3,0∗M3,0 + N3,1∗M3,1 + N3,2∗M3,2 + N3,3∗M3,3 + N3,4∗M3,4
+ N4,0*M4,0 + N4,1*M4,1 + N4,2*M4,2 + N4,3*M4,3 + N4,4*M4,4
+ N4,0∗M4,0 + N4,1∗M4,1 + N4,2∗M4,2 + N4,3∗M4,3 + N4,4∗M4,4
= 1*1 + 2*2 + 3*3 + 4*2 + 5*1
= 1∗1 + 2∗2 + 3∗3 + 4∗2 + 5∗1
+ 2*2 + 3*3 + 4*4 + 5*3 + 6*2
+ 2∗2 + 3∗3 + 4∗4 + 5∗3 + 6∗2
+ 3*3 + 4*4 + 5*5 + 6*4 + 7*3
+ 3∗3 + 4∗4 + 5∗5 + 6∗4 + 7∗3
+ 5*1 + 6*2 + 7*3 + 8*2 + 5*1
+ 5∗1 + 6∗2 + 7∗3 + 8∗2 + 5∗1
= 1 + 4 + 9 + 8 + 5
= 1 + 4 + 9 + 8 + 5
+ 4 + 9 + 16 + 15 + 12
+ 4 + 9 + 16 + 15 + 12
+ 9 + 16 + 25 + 24 + 21
+ 9 + 16 + 25 + 24 + 21
+ 8 + 15 + 24 + 21 + 16
+ 8 + 15 + 24 + 21 + 16
+ 5 + 12 + 21 + 16 + 5
+ 5 + 12 + 21 + 16 + 5
= 321
= 321
与一维卷积一样,二维卷积也必须处理边界条件。对于x和y维度上的边界,存在更复杂的边界条件:输出元素的计算可能涉及沿水平边界、垂直边界或两者的边界条件。图 8.5说明了涉及两个边界的P元素的计算。从图 8.5中,P 1,0的计算涉及N 子数组中缺失的两列和缺失的水平行。与一维卷积一样,不同的应用程序为这些缺失的N个元素假设不同的默认值。在我们的示例中,我们假设默认值为 0。这些边界条件也会影响平铺的效率。我们很快就会回到这一点。
Like 1D convolution, 2D convolution must also deal with boundary conditions. With boundaries in both the x and y dimensions, there are more complex boundary conditions: the calculation of an output element may involve boundary conditions along a horizontal boundary, a vertical boundary, or both. Figure 8.5 illustrates the calculation of a P element that involves both boundaries. From Figure 8.5, the calculation of P1,0 involves two missing columns and one missing horizontal row in the subarray of N. Like in 1D convolution, different applications assume different default values for these missing N elements. In our example, we assume that the default value is 0. These boundary conditions also affect the efficiency of tiling. We will come back to this point soon.
图 8.5二维卷积边界条件。
Figure 8.5 A 2D convolution boundary condition.
正如我们在第 8.1 节中提到的,所有输出 ( P ) 元素的计算可以在卷积中并行完成。这使得卷积成为并行计算的理想问题。根据我们在矩阵-矩阵乘法方面的经验,我们可以快速编写一个简单的并行卷积核。为简单起见,我们将研究一维卷积。
As we mentioned in Section 8.1, the calculation of all output (P) elements can be done in parallel in a convolution. This makes convolution an ideal problem for parallel computing. Based on our experience in matrix–matrix multiplication, we can quickly write a simple parallel convolution kernel. For simplicity, we will work on 1D convolution.
第一步是定义内核的主要输入参数。我们假设一维卷积核接收五个参数:指向输入数组N 的指针、指向输入掩码M 的指针、指向输出数组P的指针、掩码Mask_Width的大小以及输入和输出数组Width的大小。因此,我们进行了以下设置:
The first step is to define the major input parameters for the kernel. We assume that the 1D convolution kernel receives five arguments: pointer to input array N, pointer to input mask M, pointer to output array P, size of the mask Mask_Width, and size of the input and output arrays Width. Thus, we have the following set up:
__global__ void convolution_1D_basic_kernel(浮点*N,浮点*M,浮点*P,
__global__ void convolution_1D_basic_kernel(float ∗N, float ∗M, float ∗P,
int Mask_Width, int 宽度) {
int Mask_Width, int Width) {
// 内核体
// kernel body
}
}
第二步是确定并实现线程到输出元素的映射。由于输出数组是一维的,因此一种简单而好的方法是将线程组织成一维网格,并让网格中的每个线程计算一个输出元素。读者应该认识到,就输出元素而言,这与向量加法示例的排列相同。因此,我们可以使用以下语句根据每个线程的块索引、块维度和线程索引来计算输出元素索引:
The second step is to determine and implement the mapping of threads to output elements. Since the output array is one dimensional, a simple and good approach is to organize the threads into a 1D grid and have each thread in the grid calculate one output element. Readers should recognize that this is the same arrangement as the vector addition example as far as output elements are concerned. Therefore, we can use the following statement to calculate an output element index from the block index, block dimension, and thread index for each thread:
int i = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.x∗blockDim.x + threadIdx.x;
一旦确定了输出元素索引,我们就可以使用输出元素索引的偏移量来访问输入N个元素和掩码M 个元素。为了简单起见,我们假设Mask_Width是奇数,并且卷积是对称的,即Mask_Width是2*n+1,其中n是整数。P[i]的计算将使用N[in] , N[i-n+1] ,..., N[i-1] , N[i] , N[i+1] , ..., N[i+ n-1] , N[i+n]。我们可以使用一个简单的循环在内核中进行此计算:
Once we determined the output element index, we can access the input N elements and the mask M elements using offsets to the output element index. For simplicity, we assume that Mask_Width is an odd number and the convolution is symmetric, that is, Mask_Width is 2∗n+1 where n is an integer. The calculation of P[i] will use N[i-n], N[i-n+1],…, N[i-1], N[i], N[i+1], …, N[i+n-1], N[i+n]. We can use a simple loop to do this calculation in the kernel:
浮动 P 值 = 0;
float Pvalue = 0;
int N_start_point = i - (Mask_Width/2);
int N_start_point = i - (Mask_Width/2);
for (int j = 0; j < Mask_Width; j++) {
for (int j = 0; j < Mask_Width; j++) {
if (N_start_point + j >= 0 && N_start_point + j < 宽度) {
if (N_start_point + j >= 0 && N_start_point + j < Width) {
Pvalue += N[N_start_point + j]*M[j];
Pvalue += N[N_start_point + j]∗M[j];
}
}
}
}
变量Pvalue将允许所有中间结果累积在寄存器中以节省 DRAM 带宽。for循环累积相邻元素对输出P元素的所有贡献。循环中的if语句测试所使用的输入N个元素中是否有任何一个是幽灵元素(位于N数组的左侧或右侧)。由于我们假设0值将用于幽灵元素,因此我们可以简单地跳过幽灵元素与其对应的N元素的乘法和累加。循环结束后,我们将Pvalue释放到输出P元素中。我们现在有一个简单的内核,如图8.6所示。
The variable Pvalue will allow all intermediate results to be accumulated in a register to save DRAM bandwidth. The for loop accumulates all the contributions from the neighboring elements to the output P element. The if statement in the loop tests if any of the input N elements used are ghost elements, either on the left side or the right side of the N array. Since we assume that 0 values will be used for ghost elements, we can simply skip the multiplication and accumulation of the ghost element and its corresponding N element. After the end of the loop, we release the Pvalue into the output P element. We now have a simple kernel in Figure 8.6.
图 8.6具有边界条件处理的一维卷积核。
Figure 8.6 A 1D convolution kernel with boundary condition handling.
我们可以对图 8.6中的内核进行两个观察。首先,会出现控制流发散。计算P数组左端或右端附近的输出P元素的线程将处理幽灵元素。正如我们在第 8.1 节中所示,每个相邻线程都会遇到不同数量的幽灵元素。因此,它们在if语句中的决定都会有所不同。计算P[0]的线程将跳过大约一半的乘法累加语句,而计算P[1]的线程将跳过一次乘法累加语句,依此类推。控制发散的成本将取决于Width、输入数组的大小和Mask_Width(掩码的大小)。对于大输入数组和小掩码,控制发散仅发生在输出元素的一小部分,这将使控制发散的影响很小。由于卷积通常应用于大型图像和空间数据,因此我们通常预计收敛的效果将是适度的或微不足道的。
We can make two observations about the kernel in Figure 8.6. First, there will be control flow divergence. The threads that calculate the output P elements near the left end or the right end of the P array will handle ghost elements. As we showed in Section 8.1, each of these neighboring threads will encounter a different number of ghost elements. Therefore, they will all be somewhat different decisions in the if statement. The thread that calculates P[0] will skip the multiply-accumulate statement about half of the time, whereas the one that calculates P[1] will skip one fewer times, and so on. The cost of control divergence will depend on Width, the size of the input array, and Mask_Width (the size of the masks). For large input arrays and small masks, the control divergence only occurs to a small portion of the output elements, which will keep the effect of control divergence small. Since convolution is often applied to large images and spatial data, we typically expect that the effect of convergence will be modest or insignificant.
更严重的问题是内存带宽。内核中浮点算术计算与全局内存访问的比例仅为1.0左右。正如我们在矩阵-矩阵乘法示例中所看到的,这个简单的内核只能以峰值性能的一小部分运行。我们将在接下来的两节中讨论减少全局内存访问次数的两种关键技术。
A more serious problem is memory bandwidth. The ratio of floating-point arithmetic calculation to global memory accesses is only about 1.0 in the kernel. As we have seen in the matrix–matrix multiplication example, this simple kernel can only be expected to run at a small fraction of the peak performance. We will discuss two key techniques for reducing the number of global memory accesses in the next two sections.
我们可以对掩码数组M在卷积中的使用方式进行三个有趣的观察。首先, M数组的大小通常很小。大多数卷积掩码每个维度的元素少于 10 个。即使在 3D 卷积的情况下,掩模通常也仅包含不到 1,000 个元素。其次, M的内容在内核执行过程中不会改变。第三,所有线程都需要访问掩码元素。更好的是,所有线程都以相同的顺序访问M 个元素,从M[0]开始,并通过图 8.6中的for循环迭代一次移动一个元素。这两个属性使掩码数组成为常量内存和缓存的绝佳候选者。
We can make three interesting observations about the way the mask array M is used in convolution. First, the size of the M array is typically small. Most convolution masks are less than 10 elements in each dimension. Even in the case of a 3D convolution, the mask typically contains only less than 1,000 elements. Second, the contents of M are not changed throughout the execution of the kernel. Third, all threads need to access the mask elements. Even better, all threads access the M elements in the same order, starting from M[0] and move by one element a time through the iterations of the for loop in Figure 8.6. These two properties make the mask array an excellent candidate for constant memory and caching.
图 8.7 CUDA 内存模型回顾。
Figure 8.7 A review of the CUDA memory model.
CUDA 编程模型允许程序员在常量内存中声明变量。与全局内存变量一样,常量内存变量也对所有线程块可见。主要区别在于常量内存变量在内核执行期间不能被线程更改。此外,恒定存储器的大小可以根据设备的不同而变化。设备上可用的恒定内存量可以通过设备属性查询来了解。假设dev_prop由cudaGetDeviceProperties()返回。字段dev_prop.totalConstMem指示该字段中设备上可用的常量内存量。
The CUDA programming model allows programmers to declare a variable in the constant memory. Like global memory variables, constant memory variables are also visible to all thread blocks. The main difference is that a constant memory variable cannot be changed by threads during kernel execution. Furthermore, the size of the constant memory can vary from device to device. The amount of constant memory available on a device can be learned with a device property query. Assume that dev_prop is returned by cudaGetDeviceProperties(). The field dev_prop.totalConstMem indicates the amount of constant memory available on a device is in the field.
要使用常量内存,主机代码需要以与全局内存变量不同的方式分配和复制常量内存变量。要在常量内存中声明M数组,主机代码将其声明为全局变量,如下所示:
To use constant memory, the host code needs to allocate and copy constant memory variables in a different way than global memory variables. To declare an M array in constant memory, the host code declares it as a global variable as follows:
#定义 MAX_MASK_WIDTH 10
#define MAX_MASK_WIDTH 10
__constant__ float M[MAX_MASK_WIDTH];
__constant__ float M[MAX_MASK_WIDTH];
这是一个全局变量声明,应该位于源文件中的任何函数之外。关键字__constant__(每边两个下划线)告诉编译器数组M应放入设备常量内存中。
This is a global variable declaration and should be outside any function in the source file. The keyword __constant__ (two underscores on each side) tells the compiler that array M should be placed into the device constant memory.
假设主机代码已经使用Mask_Width元素在主机存储器中的掩码h_M数组中分配并初始化了掩码。h_M的内容可以传输到设备常量存储器中的M ,如下所示:
Assume that the host code has already allocated and initialized the mask in a mask h_M array in the host memory with Mask_Width elements. The contents of the h_M can be transferred to M in the device constant memory as follows:
cudaMemcpyToSymbol(M, h_M, Mask_Width*sizeof(float));
cudaMemcpyToSymbol(M, h_M, Mask_Width∗sizeof(float));
请注意,这是一个特殊的内存复制函数,它通知 CUDA 运行时复制到常量内存中的数据在内核执行期间不会更改。一般来说,cudaMemcpyToSymbol()函数的使用如下:
Note that this is a special memory copy function that informs the CUDA runtime that the data being copied into the constant memory will not be changed during kernel execution. In general, the use of the cudaMemcpyToSymbol() function is as follows:
cudaMemcpyToSymbol(目标,源,大小)
cudaMemcpyToSymbol(dest, src, size)
其中dest是指向常量内存中目标位置的指针,src是指向主机内存中源数据的指针,size是要复制的字节数。
where dest is a pointer to the destination location in the constant memory, src is a pointer to the source data in the host memory, and size is the number of bytes to be copied.
内核函数将常量内存变量作为全局变量访问。因此,它们的指针不需要作为参数传递给内核。我们可以修改内核以使用常量内存,如图8.8所示。请注意,内核看起来与图 8.6中的内核几乎相同。唯一的区别是M不再通过作为参数传入的指针来访问。现在它作为主机代码声明的全局变量进行访问。请记住,全局变量的所有 C 语言作用域规则都适用于此处。如果主机代码和内核代码是在不同的文件中,内核代码文件必须包含相关的外部声明信息,以保证M的声明对内核可见。
Kernel functions access constant memory variables as global variables. Thus, their pointers do not need to be passed to the kernel as parameters. We can revise our kernel to use the constant memory as shown in Figure 8.8. Note that the kernel looks almost identical to that in Figure 8.6. The only difference is that M is no longer accessed through a pointer passed in as a parameter. It is now accessed as a global variable declared by the host code. Keep in mind that all the C language scoping rules for global variables apply here. If the host code and kernel code are in different files, the kernel code file must include the relevant external declaration information to ensure that the declaration of M is visible to the kernel.
图 8.8使用M常量内存的一维卷积核。
Figure 8.8 A 1D convolution kernel using constant memory for M.
与全局内存变量一样,常量内存变量也位于 DRAM 中。然而,由于 CUDA 运行时知道常量内存变量在内核执行期间不会被修改,因此它指示硬件在内核执行期间积极缓存常量内存变量。要了解恒定内存使用的好处,我们需要首先了解有关现代处理器内存和缓存层次结构的更多信息。
Like global memory variables, constant memory variables are also located in DRAM. However, because the CUDA runtime knows that constant memory variables are not modified during kernel execution, it directs the hardware to aggressively cache the constant memory variables during kernel execution. To understand the benefit of constant memory usage, we need to first understand more about modern processor memory and cache hierarchies.
在现代处理器中,从 DRAM 访问变量需要数百甚至数千个时钟周期。此外,从 DRAM 访问变量的速率通常远低于处理器执行算术运算的速率。 DRAM 的长延迟和有限带宽一直是几乎所有现代处理器(通常称为内存墙)的主要瓶颈。为了减轻内存瓶颈的影响,现代处理器通常采用片上高速缓冲存储器或高速缓存,以减少需要从 DRAM 访问的变量数量(图 8.9)。
In modern processors, accessing a variable from DRAM takes hundreds if not thousands of clock cycles. Also, the rate at which variables can be accessed from DRAM is typically much lower than the rate at which processors can perform an arithmetic operation. The long latency and limited bandwidth of DRAM has been a major bottleneck in virtually all modern processors commonly referred to as the memory wall. To mitigate the effect of memory bottleneck, modern processors commonly employ on-chip cache memories, or caches, to reduce the number of variables that need to be accessed from DRAM (Figure 8.9).
图 8.9现代处理器缓存层次结构的简化视图。
Figure 8.9 A simplified view of the cache hierarchy of modern processors.
与 CUDA 共享内存或一般的暂存器内存不同,缓存对程序是“透明的”。也就是说,要使用 CUDA 共享内存,程序需要将变量声明为__shared__并显式地将全局内存变量移动到共享内存变量中。另一方面,当使用缓存时,程序只需访问原始变量。处理器硬件会自动保留一些缓存中最近或最常用的变量并记住它们的原始 DRAM 地址。当稍后使用保留的变量之一时,硬件将从它们的地址检测到该变量的副本在缓存中可用。然后将从缓存中提供变量的值,从而无需访问 DRAM。
Unlike CUDA shared memory, or scratchpad memories in general, caches are “transparent” to programs. That is, to use CUDA shared memory, a program needs to declare variables as __shared__ and explicitly move a global memory variable into a shared memory variable. On the other hand, when using caches, the program simply accesses the original variables. The processor hardware will automatically retain some of the most recently or frequently used variables in the cache and remember their original DRAM address. When one of the retained variables is used later, the hardware will detect from their addresses that a copy of the variable is available in the cache. The value of the variable will then be provided from the cache, eliminating the need to access DRAM.
存储器的大小和存储器的速度之间存在权衡。因此,现代处理器通常采用多级缓存。这些高速缓存级别的编号约定反映了到处理器的距离。最低级别(L1 或级别 1)是直接连接到处理器核心的高速缓存。它的运行速度在延迟和带宽方面都非常接近处理器。然而,L1 缓存的大小较小,通常在 16 KB 到 64 KB 之间。 L2 缓存更大,范围为 128 KB 到 1 MB,但可能需要数十个周期才能访问。它们通常在 CUDA 设备中的多个处理器内核或流式多处理器 (SM) 之间共享。在当今的一些高端处理器中,甚至有大小可达数 MB 的 L3 缓存。
There is a trade-off between the size of a memory and the speed of a memory. As a result, modern processors often employ multiple levels of caches. The numbering convention for these cache levels reflects the distance to the processor. The lowest level, L1 or level 1, is the cache that is directly attached to a processor core. It runs at a speed very close to the processor in both latency and bandwidth. However, an L1 cache is small in size, typically between 16 KB and 64 KB. L2 caches are larger, in the range of 128 KB to 1 MB, but can take tens of cycles to access. They are typically shared among multiple processor cores, or streaming multiprocessors (SMs) in a CUDA device. In some high-end processors today, there are even L3 caches that can be of several MB in size.
在大规模并行处理器中使用缓存的一个主要设计问题是缓存一致性,当一个或多个处理器核心修改缓存数据时就会出现这种情况。由于一级缓存通常只直接连接到一个处理器内核,因此其他处理器内核不容易观察到其内容的变化。如果修改的变量在不同处理器内核上运行的线程之间共享,这会导致问题。需要缓存一致性机制来确保其他处理器内核的缓存内容得到更新。在大规模并行处理器中提供高速缓存一致性是困难且昂贵的。然而,它们的存在通常会简化并行软件开发。因此,现代 CPU 通常支持处理器内核之间的缓存一致性。虽然现代 GPU 提供两级缓存,但它们通常没有缓存一致性来最大化可用的硬件资源,从而提高处理器的算术吞吐量。
A major design issue with using caches in a massively parallel processor is cache coherence, which arises when one or more processor cores modify cached data. Since L1 caches are typically directly attached to only one of the processor cores, changes in its contents are not easily observed by other processor cores. This causes a problem if the modified variable is shared among threads running on different processor cores. A cache coherence mechanism is needed to ensure that the contents of the caches of the other processor cores are updated. Cache coherence is difficult and expensive to provide in massively parallel processors. However, their presence typically simplifies parallel software development. Therefore, modern CPUs typically support cache coherence among processor cores. While modern GPUs provide two levels of caches, they typically do without cache coherence to maximize hardware resources available to increase the arithmetic throughput of the processor.
常量内存变量在大规模并行处理器中使用缓存时发挥着有趣的作用。由于它们在内核执行期间不会改变,因此在内核执行期间不存在缓存一致性问题。因此,硬件可以积极地将常量变量值缓存在L1缓存中。此外,这些处理器中的缓存设计通常经过优化,可以将值广播到大量线程。因此,当 warp 中的所有线程访问相同的常量内存变量时(如M的情况),缓存可以提供大量带宽来满足线程的数据需求。此外,由于M的大小通常很小,因此我们可以假设所有M元素实际上总是从缓存访问。因此,我们可以简单地假设M次访问没有花费 DRAM 带宽。通过使用常量缓存,我们有效地将浮点运算与内存访问的比率增加了一倍,达到 2。
Constant memory variables play an interesting role in using caches in massively parallel processors. Since they are not changed during kernel execution, there is no cache coherence issue during the execution of a kernel. Therefore, the hardware can aggressively cache the constant variable values in L1 caches. Furthermore, the design of caches in these processors is typically optimized to broadcast a value to a large number of threads. As a result, when all threads in a warp access the same constant memory variable, as is the case with M, the caches can provide a tremendous amount of bandwidth to satisfy the data needs of threads. Also, since the size of M is typically small, we can assume that all M elements are effectively always accessed from caches. Therefore, we can simply assume that no DRAM bandwidth is spent on M accesses. With the use of constant caching, we have effectively doubled the ratio of floating-point arithmetic to memory access to 2.
对输入N数组元素的访问也可以受益于较新设备中的缓存。我们将在8.5 节中回到这一点。
The accesses to the input N array elements can also benefit from caching in more recent devices. We will come back to this point in Section 8.5.
我们现在解决使用平铺卷积算法访问N数组元素时的内存带宽问题。回想一下,在平铺算法中,线程协作将输入元素加载到片上存储器中,然后访问片上存储器以便随后使用这些元素。为了简单起见,我们将继续假设每个线程计算一个输出P元素。一个块中最多有 1,024 个线程,我们最多可以处理 1,024 个数据元素。我们将每个块处理的输出元素的集合称为输出图块。图 8.10显示了一个 16 元素一维卷积的小示例,该卷积使用四个线程块,每个线程块有四个线程。在此示例中,有四个输出图块。第一个输出图块覆盖P[0]到P[3],第二个输出图块覆盖P[4]到P[7],第三个图块P[8]到P[11],以及第四个图块P[12]到P[15]。请记住,我们每个块使用四个线程来保持示例较小。实际上,当前一代硬件的每个块应该至少有 32 个线程。从现在开始,我们假设M 个元素位于常量内存中。
We now address the memory bandwidth issue in accessing the N array element with a tiled convolution algorithm. Recall that in a tiled algorithm, threads collaborate to load input elements into an on-chip memory and then access the on-chip memory for their subsequent use of these elements. For simplicity, we will continue to assume that each thread calculates one output P element. With up to 1,024 threads in a block we can process up to 1,024 data elements. We will refer to the collection of output elements processed by each block as an output tile. Figure 8.10 shows a small example of a 16-element, 1D convolution using four thread blocks of four threads each. In this example, there are four output tiles. The first output tile covers P[0] through P[3], the second tile P[4] through P[7], the third tile P[8] through P[11], and the fourth tile P[12] through P[15]. Keep in mind that we use four threads per block to keep the example small. In practice, there should be at least 32 threads per block for the current generation of hardware. From this point on, we will assume that M elements are in the constant memory.
图 8.10一维平铺卷积示例。
Figure 8.10 A 1D tiled convolution example.
我们将讨论两种用于减少全局内存访问总数的输入数据平铺策略。第一个是最直观的,涉及将计算线程块的所有输出元素所需的所有输入数据元素加载到共享内存中。要加载的输入元素的数量取决于掩码的大小。为了简单起见,我们将继续假设掩码大小是等于 2× n +1的奇数。也就是说,每个输出元素P[i]是对应输入元素N[i]处的输入元素、左边的n 个输入元素 ( N[in] , … , N[i-1] )的加权和,以及右侧的n 个输入元素 ( N[i+1], …, N[i+n] )。图 8.10显示了n =2的示例。
We will discuss two input data tiling strategies for reducing the total number of global memory accesses. The first one is the most intuitive and involves loading all input data elements needed for calculating all output elements of a thread block into the shared memory. The number of input elements to be loaded depends on the size of the mask. For simplicity, we will continue to assume that the mask size is an odd number equal to 2×n+1. That is, each output element P[i] is a weighted sum of the input element at the corresponding input element N[i], the n input elements to the left (N[i-n], …, N[i-1]), and the n input elements to the right (N[i+1], …, N[i+n]). Figure 8.10 shows an example where n=2.
块 0 中的线程计算输出元素P[0]到P[3]。这是输出数据中最左边的图块,通常称为左边界图块。它们共同需要输入元素N[0]到N[5]。请注意,计算还需要N[0]左侧的两个幽灵元素。这在图 8.6的图块 0 的左端显示为两个虚线空元素。这些幻影元素将被假定具有默认值 0。图块 3 在输入数组N的右端具有类似的情况。在我们的讨论中,我们将像图块 0 和图块 3 这样的图块称为边界图块因为它们涉及输入数组N边界处或边界之外的元素。
Threads in block 0 calculate output elements P[0] through P[3]. This is the leftmost tile in the output data and is often referred to as the left boundary tile. They collectively require input elements N[0] through N[5]. Note that the calculation also requires two ghost elements to the left of N[0]. This is shown as two dashed empty elements on the left end of tile 0 of Figure 8.6. These ghost elements will be assumed have a default value of 0. Tile 3 has a similar situation at the right end of input array N. In our discussions, we will refer to tiles like tile 0 and tile 3 as boundary tiles since they involve elements at or outside the boundary of the input array N.
块 1 中的线程计算输出元素P[4]到P[7]。它们共同需要输入元素N[2]到N[9] ,也如图8.10所示。图 8.10中图块 1 和图块 2 的计算不涉及幻影元素,通常称为内部图块。请注意,元素N[2]和N[3]属于两个图块,并且被加载到共享内存中两次,一次加载到块 0 的共享内存,一次加载到块 1 的共享内存。一个块仅对该块的线程可见,这些元素需要加载到各自的共享内存中,以便所有涉及的线程访问它们。涉及多个图块并由多个块加载的元素通常称为光环元素或裙边元素,因为它们“悬挂”在仅由单个块使用的部分的一侧。我们将指的是输入图块的中心部分,该部分仅由该输入图块的内部元素的单个块使用。图块 1 和图块 2 通常被称为内部图块,因为它们不涉及输入数组N的边界处或外部的任何幻影元素。
Threads in block 1 calculate output elements P[4] through P[7]. They collectively require input elements N[2] through N[9], also shown in Figure 8.10. Calculations for tiles 1 and 2 in Figure 8.10 do not involve ghost elements and are often referred to as internal tiles. Note that elements N[2] and N[3] belong to two tiles and are loaded into the shared memory twice, once to the shared memory of block 0 and once to the shared memory of block 1. Since the contents of shared memory of a block are only visible to the threads of the block, these elements need to be loaded into the respective shared memories for all involved threads to access them. The elements that are involved in multiple tiles and loaded by multiple blocks are commonly referred to as halo elements or skirt elements since they “hang” from the side of the part that is used solely by a single block. We will refer to the center part of an input tile that is solely used by a single block the internal elements of that input tile. Tiles 1 and 2 are commonly referred to as internal tiles since they do not involve any ghost elements at or outside the boundaries of the input array N.
我们现在展示将输入图块加载到共享内存中的内核代码。我们首先声明一个共享内存数组N_ds来保存每个块的N 个图块。共享存储器阵列的大小必须足够大以容纳输入图块的左光环元素、中心元素和右光环元素。我们假设Mask_Size是奇数。总计为TILE_SIZE + MAX_MASK_WIDTH -1,在内核中的以下声明中使用:
We now show the kernel code that loads the input tile into shared memory. We first declare a shared memory array, N_ds, to hold the N tile for each block. The size of the shared memory array must be large enough to hold the left halo elements, the center elements, and the right halo elements of an input tile. We assume that Mask_Size is an odd number. The total is TILE_SIZE + MAX_MASK_WIDTH -1, which is used in the following declaration in the kernel:
__shared__ float N_ds[TILE_SIZE + MAX_MASK_WIDTH - 1];
__shared__ float N_ds[TILE_SIZE + MAX_MASK_WIDTH - 1];
然后我们加载左光环元素,其中包括前一个图块的最后n = Mask_Width/2中心元素。例如,在图 8.10中,图块 1 的左光环元素由图块 0 的最后两个中心元素组成。在 C 中,假设Mask_Width是奇数,则表达式Mask_Width/2将产生一个整数值,即与(Mask_Width-1)/2相同。我们将使用块的最后(Mask_Width/2)线程来加载左光环元素。这是通过以下两个语句完成的:
We then load the left halo elements, which include the last n = Mask_Width/2 center elements of the previous tile. For example, in Figure 8.10, the left halo elements of tile 1 consist of the last two center elements of tile 0. In C, assuming that Mask_Width is an odd number, the expression Mask_Width/2 will result in an integer value that is the same as (Mask_Width-1)/2. We will use the last (Mask_Width/2) threads of the block to load the left halo element. This is done with the following two statements:
int halo_index_left = (blockIdx.x - 1)*blockDim.x + threadIdx.x;
int halo_index_left = (blockIdx.x - 1)∗blockDim.x + threadIdx.x;
if (threadIdx.x >= blockDim.x - n) {
if (threadIdx.x >= blockDim.x - n) {
N_ds[threadIdx.x - (blockDim.x - n)] =
N_ds[threadIdx.x - (blockDim.x - n)] =
(halo_index_left < 0) ? 0 : N[halo_index_left];
(halo_index_left < 0) ? 0 : N[halo_index_left];
在第一个语句中,我们使用表达式(blockIdx.x-1)*blockDim.x+threadIdx.x将线程索引映射到前一个图块中的元素索引。然后,我们使用if语句中的条件仅选择最后n 个线程来加载所需的左光环元素。例如,在图8.6中,blockDim.x等于4,n等于2;仅使用线程 2 和 3。由于失败情况,线程 0 和 1 将不会加载任何内容。
In the first statement, we map the thread index to the element index into the previous tile with the expression (blockIdx.x-1)∗blockDim.x+threadIdx.x. We then pick only the last n threads to load the needed left halo elements using the condition in the if statement. For example, in Figure 8.6, blockDim.x equals 4 and n equals 2; only threads 2 and 3 will be used. Threads 0 and 1 will not load anything due to the failed condition.
对于所使用的线程,我们还需要检查它们的光环元素是否是幽灵元素。这可以通过测试计算的halo_index_left值是否为负来检查。如果是这样,则晕元素实际上是鬼元素,因为它们的N索引为负,超出了N索引的有效范围。在这种情况下,条件 C 赋值将为线程选择 0。否则,条件语句将使用 halo_index_left将适当的N个元素加载到共享内存中。共享内存索引计算是这样的,左边的 halo 元素将从元素 0 开始加载到共享内存数组中。例如,在图 8.10中,blockDim.xn等于 2。因此,对于块 1,线程 2 将加载最左边的 halo 元素到N_ds[0]中,线程 3 会将下一个 halo 元素加载到N_ds[1]中。然而,对于块 0,线程 2 和 3 都会将值 0 加载到N_ds[0]和N_ds[1]中。
For the threads used, we also need to check if their halo elements are ghost elements. This can be checked by testing if the calculated halo_index_left value is negative. If so, the halo elements are actually ghost elements since their N indices are negative, outside the valid range of the N indices. The conditional C assignment will choose 0 for threads in this situation. Otherwise, the conditional statement will use the halo_index_left to load the appropriate N elements into the shared memory. The shared memory index calculation is such that left halo elements will be loaded into the shared memory array starting at element 0. For example, in Figure 8.10, blockDim.x-n equals 2. So for block 1, thread 2 will load the leftmost halo element into N_ds[0] and thread 3 will load the next halo element into N_ds[1]. However, for block 0, both threads 2 and 3 will load value 0 into N_ds[0] and N_ds[1].
下一步是加载输入图块的中心元素。这是通过将blockIdx.x和threadIdx.x值映射到适当的N索引来完成的,如以下语句所示。读者应该熟悉所使用的N索引表达式:
The next step is to load the center elements of the input tile. This is done by mapping the blockIdx.x and threadIdx.x values into the appropriate N indices, as shown in the following statement. Readers should be familiar with the N index expression used:
N_ds[n + threadIdx.x] = N[blockIdx.x*blockDim.x + threadIdx.x];
N_ds[n + threadIdx.x] = N[blockIdx.x∗blockDim.x + threadIdx.x];
由于N_ds数组的前n 个元素已经包含左晕元素,因此需要将中心元素加载到N_ds的下一部分中。这是通过将n添加到threadIdx.x作为每个线程的索引来将其加载的中心元素写入N_ds来完成的。
Since the first n elements of the N_ds array already contain the left halo elements, the center elements need to be loaded into the next section of N_ds. This is done by adding n to threadIdx.x as the index for each thread to write its loaded center element into N_ds.
现在我们加载右侧光环元素,这与加载左侧光环非常相似。我们首先将blockIdx.x和threadIdx.x映射到下一个输出图块的元素。这是通过将(blockIdx.x+1)*blockDim.x添加到线程索引以形成右侧光环元素的N索引来完成的。在本例中,我们正在加载开始的Mask_Width:
We now load the right halo elements, which is quite similar to loading the left halo. We first map the blockIdx.x and threadIdx.x to the elements of next output tile. This is done by adding (blockIdx.x+1)∗blockDim.x to the thread index to form the N index for the right halo elements. In this case, we are loading the beginning Mask_Width:
int halo_index_right=(blockIdx.x + 1)*blockDim.x + threadIdx.x;
int halo_index_right=(blockIdx.x + 1)∗blockDim.x + threadIdx.x;
if (threadIdx.x < n) {
if (threadIdx.x < n) {
N_ds[n + blockDim.x + threadIdx.x] =
N_ds[n + blockDim.x + threadIdx.x] =
(halo_index_right >= 宽度) ? 0 : N[halo_index_right];
(halo_index_right >= Width) ? 0 : N[halo_index_right];
现在所有输入图块元素都在N_ds中,每个线程可以使用N_ds元素计算其输出P元素值。每个线程将使用N_ds的不同部分。线程 0 将使用N_ds[0]到N_ds[Mask_Width-1];线程 1 将使用N_ds[1]到N[Mask_Width]。一般来说,每个线程都会使用N_ds[threadIdx.x]到N[threadIdx.x+Mask_Width-1]。这是在以下for循环中实现的,用于计算分配给线程的P元素:
Now that all the input tile elements are in N_ds, each thread can calculate their output P element value using the N_ds elements. Each thread will use a different section of the N_ds. Thread 0 will use N_ds[0] through N_ds[Mask_Width-1]; thread 1 will use N_ds[1] through N[Mask_Width]. In general, each thread will use N_ds[threadIdx.x] through N[threadIdx.x+Mask_Width-1]. This is implemented in the following for loop to calculate the P element assigned to the thread:
浮动 P 值 = 0;
float Pvalue = 0;
for(int j = 0; j < Mask_Width; j++) {
for(int j = 0; j < Mask_Width; j++) {
Pvalue += N_ds[threadIdx.x + j]*M[j];
Pvalue += N_ds[threadIdx.x + j]∗M[j];
}
}
P[i] = P值;
P[i] = Pvalue;
然而,一定不要忘记使用syncthreads()进行屏障同步,以确保同一块中的所有线程在任何人开始从共享内存中使用它们之前都已完成加载分配的N 个元素。
However, one must not forget to do a barrier synchronization using syncthreads() to make sure that all threads in the same block have completed loading their assigned N elements before anyone should start using them from the shared memory.
请注意,乘法和累加的代码比基本算法更简单。用于加载左右光环元素的条件语句已将 0 值放入第一个和最后一个线程块的相应N_ds元素中。
Note that the code for multiply and accumulate is simpler than the base algorithm. The conditional statements for loading the left and right halo elements have placed the 0 values into the appropriate N_ds elements for the first and last thread block.
平铺一维卷积核比基本核明显更长、更复杂。我们引入了额外的复杂性来减少N 个元素的 DRAM 访问数量。目标是提高算术与内存访问的比率,以便实现的性能不受 DRAM 带宽的限制或较少限制。我们将通过比较图 8.8和图 8.11中内核的每个线程块执行的 DRAM 访问次数来评估改进情况。
The tiled 1D convolution kernel is significantly longer and more complex than the basic kernel. We introduced the additional complexity to reduce the number of DRAM accesses for the N elements. The goal is to improve the arithmetic to memory access ratio so that the achieved performance is not limited or less limited by the DRAM bandwidth. We will evaluate improvement by comparing the number of DRAM accesses performed by each thread block for the kernels in Figure 8.8 and Figure 8.11.
图 8.11使用M的常量内存的平铺一维卷积核。
Figure 8.11 A tiled 1D convolution kernel using constant memory for M.
在图8.8中,有两种情况。对于不处理幽灵元素的线程块,每个线程访问的N 个元素的数量为Mask_Width。因此,每个线程块访问的N 个元素的总数为blockDim.x*Mask_Width或blockDim.x*(2n+1)。例如,如果Mask_Width等于 5,并且每个块包含 1,024 个线程,则每个块总共访问 5,120 N 个元素。
In Figure 8.8, there are two cases. For thread blocks that do not handle ghost elements, the number of N elements accessed by each thread is Mask_Width. Thus, the total number of N elements accessed by each thread block is blockDim.x∗Mask_Width or blockDim.x∗(2n+1). For example, if Mask_Width is equal to 5 and each block contains 1,024 threads, each block accesses a total of 5,120 N elements.
对于第一个和最后一个块,即处理幻影元素的线程,不会对幻影元素进行内存访问。这减少了存储器访问的次数。我们可以通过枚举使用每个幽灵元素的线程数来计算减少的内存访问次数。图 8.12中的一个小例子对此进行了说明。最左边的幽灵元素由一个线程使用。左第二个幽灵元素由两个线程使用。一般来说,鬼元素的个数为n,使用每个幽灵元素的线程数,从左到右是 1, 2, …, n。这是一个简单的序列,总和为n ( n + 1)/2,它是由于幽灵元素而避免的访问总数。对于我们的简单示例,其中Mask_Width等于 5 并且n等于 2,由于幽灵元素而避免的访问次数为 2×3/2=3。对于正确的鬼元素,类似的分析给出了相同的结果。应该清楚的是,对于大线程块,鬼元素对小掩码尺寸的影响将是微不足道的。
For the first and last blocks, the threads that handle ghost elements, no memory access is done for the ghost elements. This reduces the number of memory accesses. We can calculate the reduced number of memory accesses by enumerating the number of threads that use each ghost element. This is illustrated with a small example in Figure 8.12. The leftmost ghost element is used by one thread. The second left ghost element is used by two threads. In general, the number of ghost elements is n and the number of threads that use each of these ghost elements, from left to right is 1, 2, …, n. This is a simple series with sum n(n + 1)/2, which is the total number of accesses that were avoided due to ghost elements. For our simple example where Mask_Width is equal to 5 and n is equal to 2, the number of accesses avoided due to ghost elements is 2×3/2=3. A similar analysis gives the same results for the right ghost elements. It should be clear that for large thread blocks, the effect of ghost elements for small mask sizes will be insignificant.
图 8.12访问N元素和幽灵元素的小例子。
Figure 8.12 A small example of accessing N elements and ghost elements.
现在我们计算图 8.11中平铺内核对N个元素的内存访问总数。所有内存访问都已转移到将N个元素加载到共享内存中的代码。在平铺内核中,每N个元素仅由一个线程加载。但是,对于不处理幽灵元素的块,也将加载2 n 个光环元素, n 个从左侧,n 个从右侧。因此,我们将blockDim.x+2n元素用于内部线程块,将blockDim+n 元素用于边界线程块。
We now calculate the total number of memory accesses for N elements by the tiled kernel in Figure 8.11. All the memory accesses have been shifted to the code that loads the N elements into the shared memory. In the tiled kernel, each N element is only loaded by one thread. However, 2n halo elements will also be loaded, n from the left and n from the right, for blocks that do not handle ghost elements. Therefore, we have the blockDim.x+2n elements in for the internal thread blocks and blockDim+n for boundary thread blocks.
对于内部线程块,基本和平铺一维卷积核之间的内存访问比率为
For internal thread blocks, the ratio of memory accesses between the basic and the tiled 1D convolution kernel is
(blockDim.x*(2n+1)) / (blockDim.x+2n)
(blockDim.x∗(2n+1)) / (blockDim.x+2n)
而边界块的比率是
whereas the ratio for boundary blocks is
(blockDim.x*(2n+1) − n(n+1)/2) / (blockDim.x+n)
(blockDim.x∗(2n+1) − n(n+1)/2) / (blockDim.x+n)
对于大多数情况,blockDim.x比n大得多。两个比率都可以通过消除小项n ( n + 1)/2 和n来近似:
For most situations, blockDim.x is much larger than n. Both ratios can be approximated by eliminating the small terms n(n + 1)/2 and n:
(blockDim.x*(2n+1)/ blockDim.x = 2n+1 = Mask_Width
(blockDim.x∗(2n+1)/ blockDim.x = 2n+1 = Mask_Width
这应该是一个非常直观的结果。在原始算法中,每个N个元素由大约Mask_Width线程冗余加载。例如,在图 8.12中,N[2]由计算P[2]、P[3]、P[4]、P[5]和P[6]的五个线程加载。也就是说,存储器访问减少的比率大约与掩码大小成正比。
This should be quite an intuitive result. In the original algorithm, each N element is redundantly loaded by approximately Mask_Width threads. For example, in Figure 8.12, N[2] is loaded by the five threads that calculate P[2], P[3], P[4], P[5], and P[6]. That is, the ratio of memory access reduction is approximately proportional to the mask size.
然而,在实践中,较小项的影响可能是显着的并且不能被忽视。例如,如果blockDim.x为 128,n为 5,则内部块的比率为
However, in practice, the effect of the smaller terms may be significant and cannot be ignored. For example, if blockDim.x is 128 and n is 5, the ratio for the internal blocks is
(128*11-10) / (128 + 10) = 1398 / 138 = 10.13
(128∗11-10) / (128 + 10) = 1398 / 138 = 10.13
而近似比率为 11。应该清楚的是,随着blockDim.x变小,比率也会变小。例如,如果blockDim为 32 并且n为 5,则内部块的比率变为
whereas the approximate ratio would be 11. It should be clear that as blockDim.x becomes smaller, the ratio also becomes smaller. For example, if blockDim is 32 and n is 5, the ratio for the internal blocks becomes
(32*11-10) / (32+10) = 8.14
(32∗11-10) / (32+10) = 8.14
读者在使用较小的块和图块尺寸时应始终小心。它们可能导致内存访问的减少明显少于预期。在实践中,由于以下原因,经常使用较小的瓷砖尺寸片上内存量不足,特别是对于 2D 和 3D 卷积,所需的片上内存量随着图块的尺寸而快速增长。
Readers should always be careful when using smaller block and tile sizes. They may result in significantly less reduction in memory accesses than expected. In practice, smaller tile sizes are often used due to an insufficient amount of on-chip memory, especially for 2D and 3D convolution where the amount of on-chip memory needed grows quickly with the dimension of the tile.
在图 8.11中,代码的大部分复杂性都与将左右光环元素以及内部元素加载到共享内存有关。较新的 GPU(例如 Fermi)提供通用 L1 和 L2 缓存,其中 L1 是每个 SM 专用的,而 L2 在所有 SM 之间共享。这使得块有机会利用其光环元素可能在 L2 高速缓存中可用的事实。
In Figure 8.11, much of the complexity of the code has to do with loading the left and right halo elements in addition to the internal elements into the shared memory. More recent GPUs such as Fermi provide general L1 and L2 caches, where L1 is private to each SM and L2 is shared among all SMs. This leads to an opportunity for the blocks to take advantage of the fact that their halo elements may be available in the L2 cache.
回想一下,块的光环元素也是相邻块的内部元素。例如,在图 8.10中,图块 1 的光环元素N[2]和N[3]也是图块 0 的内部元素。当块 1 需要使用这些光环元素时,它们很有可能是由于块 0 的访问,已经在 L2 高速缓存中。因此,对这些光环元素的内存访问可以自然地从 L2 高速缓存提供服务,而不会导致额外的 DRAM 流量。也就是说,我们可以将对这些 halo 元素的访问保留在原始N元素中,而不是将它们加载到N_ds中。我们现在提出一种更简单的平铺一维卷积算法,该算法仅将每个平铺的内部元素加载到共享内存中。
Recall that the halo elements of a block are also internal elements of a neighboring block. For example, in Figure 8.10, the halo elements N[2] and N[3] of tile 1 are also internal elements of tile 0. There is a significant probability that by the time block 1 needs to use these halo elements, they are already in the L2 cache due to the accesses by block 0. As a result, the memory accesses to these halo elements may be naturally served from the L2 cache without causing additional DRAM traffic. That is, we can leave the accesses to these halo elements in the original N elements rather than loading them into the N_ds. We now present a simpler tiled 1D convolution algorithm that only loads the internal elements of each tile into the shared memory.
在更简单的分片内核中,共享内存N_ds数组只需要保存分片的内部元素。因此,它是用TILE_SIZE声明的,而不是TILE_SIZE+Mask_Width-1:
In the simpler tiled kernel, the shared memory N_ds array only needs to hold the internal elements of the tile. Thus, it is declared with the TILE_SIZE, rather than TILE_SIZE+Mask_Width-1:
__shared__ float N_ds[TILE_SIZE];
__shared__ float N_ds[TILE_SIZE];
i = blockIdx.x*blockDim.x+threadIdx.x;
i = blockIdx.x∗blockDim.x+threadIdx.x;
只需一行代码即可加载图块变得非常简单:
Loading the tile becomes very simple with only one line of code:
N_ds[threadIdx.x] = N[i];
N_ds[threadIdx.x] = N[i];
在使用N_ds中的元素之前,我们仍然需要屏障同步。然而,计算P元素的循环变得更加复杂。它需要添加条件来检查光环元素和幽灵元素的使用。幽灵元素的处理与图 8.6中的条件语句相同。乘法累加语句变得更加复杂:
We still need a barrier synchronization before using the elements in N_ds. The loop that calculates P elements, however, becomes more complex. It needs to add conditions to check for use of both halo elements and ghost elements. The ghost elements are handled with the same conditional statement as that in Figure 8.6. The multiply–accumulate statement becomes more complex:
__syncthreads();
__syncthreads();
int This_tile_start_point = blockIdx.x * blockDim.x;
int This_tile_start_point = blockIdx.x ∗ blockDim.x;
int Next_tile_start_point = (blockIdx.x + 1) * blockDim.x;
int Next_tile_start_point = (blockIdx.x + 1) ∗ blockDim.x;
int N_start_point = i - (Mask_Width/2);
int N_start_point = i - (Mask_Width/2);
浮动 P 值 = 0;
float Pvalue = 0;
for (int j = 0; j < Mask_Width; j ++) {
for (int j = 0; j < Mask_Width; j ++) {
int N_index = N_start_point + j;
int N_index = N_start_point + j;
if (N_index >= 0 && N_index < 宽度) {
if (N_index >= 0 && N_index < Width) {
if ((N_index >= This_tile_start_point)
if ((N_index >= This_tile_start_point)
&& (N_index < Next_tile_start_point)) {
&& (N_index < Next_tile_start_point)) {
Pvalue += N_ds[threadIdx.x+j-(Mask_Width/2)]*M[j];
Pvalue += N_ds[threadIdx.x+j-(Mask_Width/2)]∗M[j];
} 别的 {
} else {
P值 += N[N_index] * M[j];
Pvalue += N[N_index] ∗ M[j];
}
}
}
}
}
}
P[i] = P值;
P[i] = Pvalue;
变量This_tile_start_point和Next_tile_start_point保存当前块处理的图块的起始位置索引和下一个块中的下一个处理的图块的起始位置索引。例如,在图8.10中,块1的This_tile_start_point的值为4,Next_tile_start_point的值为8。
The variables This_tile_start_point and Next_tile_start_point hold the starting position index of the tile processed by the current block and that of the tile processed by the next in the next block. For example, in Figure 8.10, the value of This_tile_start_point for block 1 is 4 and the value of Next_tile_start_point is 8.
新的if语句通过针对This_tile_start_point和Next_tile_start_point进行测试来测试当前对N元素的访问是否落入图块内。如果该元素落在图块内(即,它是当前块的内部元素),则可以从共享内存中的N_ds数组访问它。否则,从N数组访问,该数组有望位于 L2 缓存中。最终的内核代码如图8.13所示。
The new if statement tests if the current access to the N element falls within a tile by testing it against This_tile_start_point and Next_tile_start_point. If the element falls within the tile—that is, it is an internal element for the current block—it is accessed from the N_ds array in the shared memory. Otherwise, it is accessed from the N array, which is hopefully in the L2 cache. The final kernel code is shown is Figure 8.13.
图 8.13使用常量内存和通用缓存的更简单的平铺一维卷积核。
Figure 8.13 A simpler tiled 1D convolution kernel using constant memory and general caching.
尽管我们仅展示了 1D 卷积的内核示例,但这些技术可直接适用于 2D 和 3D 卷积。一般来说,由于维度较高,2D 和 3D 卷积的N和M数组的索引计算更加复杂。此外,由于在加载图块和/或计算输出值时需要遍历多个维度,因此每个线程将有更多的循环嵌套。我们鼓励读者完成这些高维内核作为家庭作业。
Although we have shown kernel examples for only a 1D convolution, the techniques are directly applicable to 2D and 3D convolutions. In general, the index calculation for the N and M arrays are more complex for 2D and 3D convolutions due to higher dimensionality. Also, one will have more loop nesting for each thread since multiple dimensions need to be traversed when loading tiles and/or calculating output values. We encourage readers to complete these higher-dimension kernels as homework exercises.
在本章中,我们研究了卷积作为一种重要的并行计算模式。虽然卷积被用于计算机视觉和视频处理等许多应用中,但它也代表了一种通用模式,构成了许多其他并行算法的基础。例如,人们可以将偏微分方程 (PDE) 求解器中的模板算法视为卷积的一种特殊情况。再举个例子,还可以查看作为卷积的一种特例,计算网格点力或势值。
In this chapter, we have studied convolution as an important parallel computation pattern. While convolution is used in many applications such as computer vision and video processing, it also represents a general pattern that forms the basis of many other parallel algorithms. For example, one can view the stencil algorithms in partial differential equation (PDE) solvers as a special case of convolution. For another example, one can also view the calculation of grid point force or the potential value as a special case of convolution.
我们提出了一种基本的并行卷积算法,其实现将受到访问输入N和掩码M元素的 DRAM 带宽的限制。然后,我们引入了常量内存以及对内核和主机代码的简单修改,以利用常量缓存并消除对掩码元素的几乎所有 DRAM 访问。我们进一步引入了平铺并行卷积算法,该算法通过引入更多的控制流发散和编程复杂性来减少 DRAM 带宽消耗。最后,我们提出了一种更简单的平铺并行卷积算法,该算法利用了 L2 缓存的优势。
We have presented a basic parallel convolution algorithm of which the implementations will be limited by DRAM bandwidth for accessing both the input N and mask M elements. We then introduced the constant memory and a simple modification to the kernel and host code to take advantage of constant caching and eliminate practically all DRAM accesses for the mask elements. We further introduced a tiled parallel convolution algorithm that reduces DRAM bandwidth consumption by introducing more control flow divergence and programming complexity. Finally, we presented a simpler tiled parallel convolution algorithm that takes advantage of the L2 caches.
8.1. 计算图 8.3中的P[0]值。
8.1. Calculate the P[0] value in Figure 8.3.
8.2. 考虑使用掩码M={2,1,4}对数组N={4,1,3,2,3}执行一维卷积。结果输出数组是什么?
8.2. Consider performing a 1D convolution on array N={4,1,3,2,3} with mask M={2,1,4}. What is the resulting output array?
8.3. What do you think the following 1D convolution masks are doing?
8.4. 考虑使用大小为m的掩码对大小为n的数组执行一维卷积:
8.4. Consider performing a 1D convolution on an array of size n with a mask of size m:
a. How many halo cells are there in total?
b. 如果将 halo 单元视为乘法(乘以 0),则执行多少次乘法?
b. How many multiplications are performed if halo cells are treated as multiplications (by 0)?
c. How many multiplications are performed if halo cells are not treated as multiplications?
8.5。 考虑使用大小为m × m的方形掩码对大小为n × n的方阵执行 2D 卷积:
8.5. Consider performing a 2D convolution on a square matrix of size n×n with a square mask of size m×m:
a. How many halo cells are there in total?
b. 如果将 halo 单元视为乘法(乘以 0),则执行多少次乘法?
b. How many multiplications are performed if halo cells are treated as multiplications (by 0)?
c. How many multiplications are performed if halo cells are not treated as multiplications?
8.6。 考虑使用大小为m 1× m 2 的矩形掩模对大小为n 1× n 2 的矩形矩阵执行 2D 卷积:
8.6. Consider performing a 2D convolution on a rectangular matrix of size n1×n2 with a rectangular mask of size m1×m2:
a. How many halo cells are there in total?
b. 如果将 halo 单元视为乘法(乘以 0),则执行多少次乘法?
b. How many multiplications are performed if halo cells are treated as multiplications (by 0)?
c. How many multiplications are performed if halo cells are not treated as multiplications?
8.7. 考虑使用大小为t的瓦片,对大小为n且掩码为m的数组执行如图 8.11所示的内核的一维瓦片卷积。
8.7. Consider performing a 1D tiled convolution with the kernel shown in Figure 8.11 on an array of size n with a mask of size m using a tile of size t.
a. How many blocks are needed?
b. How many threads per block are needed?
c. How much shared memory is needed in total?
d. 如果您使用图 8.13中的内核,请重复相同的问题。
d. Repeat the same questions if you were using the kernel in Figure 8.13.
8.8. 修改图 8.6中的 1D 内核以执行 2D 卷积。根据需要向内核声明添加更多宽度参数。
8.8. Revise the 1D kernel in Figure 8.6 to perform 2D convolution. Add more width parameters to the kernel declaration as needed.
8.9。 修改图 8.8中平铺的 1D 内核以执行 2D 卷积。请记住,还需要更改主机代码以在常量内存中声明 2D M数组。特别注意共享内存使用量的增加。此外,N_ds需要声明为 2D 共享内存数组。
8.9. Revise the tiled 1D kernel in Figure 8.8 to perform 2D convolution. Keep in mind that the host code also needs to be changed to declare a 2D M array in the constant memory. Pay special attention to the increased usage of shared memory. Also, the N_ds needs to be declared as a 2D shared memory array.
8.10。 修改图 8.11中平铺的 1D 内核以执行 2D 卷积。请记住,还需要更改主机代码以在常量内存中声明 2D M数组。特别注意共享内存使用量的增加。此外,N_ds需要声明为 2D 共享内存数组。
8.10. Revise the tiled 1D kernel in Figure 8.11 to perform 2D convolution. Keep in mind that the host code also needs to be changed to declare a 2D M array in the constant memory. Pay special attention to the increased usage of shared memory. Also, the N_ds needs to be declared as a 2D shared memory array.
9.1 背景
9.1 Background
9.2 简单的并行扫描
9.2 A Simple Parallel Scan
9.3 工作效率考虑因素
9.3 Work Efficiency Considerations
9.4 高效工作的并行扫描
9.4 A Work-Efficient Parallel Scan
9.5 任意长度输入的并行扫描
9.5 Parallel Scan for Arbitrary-Length Inputs
9.6 概括
9.6 Summary
9.7 练习
9.7 Exercises
我们的下一个并行模式是前缀和,通常也称为扫描。并行扫描经常用于将看似顺序的操作(例如资源分配、工作分配和多项式求值)转换为并行操作。一般来说,如果计算自然地被描述为数学递归,那么它很可能被并行化为并行扫描操作。并行扫描在大规模并行计算中发挥着关键作用,原因很简单:应用程序的任何顺序部分都会极大地限制应用程序的整体性能。许多这样的连续部分可以通过并行扫描转换为并行计算。并行扫描是重要的并行模式的另一个原因是顺序扫描算法是线性算法并且工作效率极高,这使得控制并行扫描算法的工作效率也非常重要。正如我们将要展示的,对于大型数据集,算法复杂性的轻微增加可能会使并行扫描的运行速度比顺序扫描慢。因此,高效工作的并行扫描也代表了一类重要的并行算法,可以在具有广泛可用计算资源的并行系统上有效运行。
Our next parallel pattern is prefix sum, which is also commonly known as scan. Parallel scan is frequently used to convert seemingly sequential operations, such as resource allocation, work assignment, and polynomial evaluation, into parallel operations. In general, if a computation is naturally described as a mathematical recursion, it can likely be parallelized as a parallel scan operation. Parallel scan plays a key role in massively parallel computing for a simple reason: any sequential section of an application can drastically limit the overall performance of the application. Many such sequential sections can be converted into parallel computation with parallel scan. Another reason why parallel scan is an important parallel pattern is that sequential scan algorithms are linear algorithms and are extremely work-efficient, which makes it also very important to control the work efficiency of parallel scan algorithms. As we will show, a slight increase in algorithm complexity can make parallel scan run slower than sequential scan for large data sets. Therefore, work-efficient parallel scan also represents an important class of parallel algorithms that can run effectively on parallel systems with a wide range of available computing resources.
从数学上讲,包含扫描操作采用二元关联运算符 ⊕ 和n 个元素的输入数组[ x 0 , x 1 , …, x n −1 ],并返回输出数组
Mathematically, an inclusive scan operation takes a binary associative operator ⊕, and an input array of n elements [x0, x1, …, xn−1], and returns the output array
[ x 0 , ( x 0 ⊕ x 1 ), …, ( x 0 ⊕ x 1 ⊕ … ⊕ x n −1 )]
[x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn−1)]
例如,如果 ⊕ 是加法,则对输入数组[3 1 7 0 4 1 6 3] 进行包含扫描操作将返回[3 4 11 11 15 16 22 25]。
For example, if ⊕ is addition, then an inclusive scan operation on the input array [3 1 7 0 4 1 6 3] would return [3 4 11 11 15 16 22 25].
我们可以使用为一群人切香肠的示例来说明包容性扫描操作的应用。假设我们有一根 40 英寸的香肠可供八个人享用。每个人订购的香肠数量不同(以英寸为单位):3、1、7、0、4、1、6、3。也就是说,0 号人员想要 3 英寸香肠,1 号人员想要 1 英寸,依此类推。我们可以顺序或并行地切香肠。顺序方式非常简单。我们首先为 0 号人员切一段 3 英寸的部分。香肠现在长 37 英寸。然后我们为 1 号人切下 1 英寸的部分。香肠变成 36 英寸长。我们可以继续切割更多的部分,直到我们将 3 英寸的部分提供给 7 号人员。此时,我们总共提供了 25 英寸的香肠,还剩下 15 英寸。
We can illustrate the applications for inclusive scan operations using an example of cutting sausage for a group of people. Assume that we have a 40-inch sausage to be served to eight people. Each person has ordered a different amount in terms of inches: 3, 1, 7, 0, 4, 1, 6, 3. That is, person number 0 wants 3 inches of sausage, person number 1 wants 1 inch, and so on. We can cut the sausage either sequentially or in parallel. The sequential way is very straightforward. We first cut a 3-inch section for person number 0. The sausage is now 37 inches long. We then cut a 1-inch section for person number 1. The sausage becomes 36 inches long. We can continue to cut more sections until we serve the 3-inch section to person number 7. At that point, we have served a total of 25 inches of sausage, with 15 inches remaining.
通过包容性扫描操作,我们可以根据每个人订购的金额计算出所有的切割点。也就是说,给定加法运算和顺序输入数组[3 1 7 0 4 1 6 3],包含扫描运算将返回[3 4 11 11 15 16 22 25]。返回数组中的数字是切割位置。有了这些信息,人们就可以同时进行所有八次切割,从而生成每个人订购的部分。第一个切割点位于 3 英寸点,因此第一部分将是 3 英寸长,按人员编号 0 排序。第二个切割点是 4,因此第二部分将是 1 英寸长,按人员编号订购1. 最后一次切割将在 25 英寸点进行,由于上次切割点在 22 英寸点,因此将产生 3 英寸长的部分。这给了 7 号人她订购的东西。请注意,由于从扫描操作中已知所有切割点,因此所有切割都可以并行完成。
With an inclusive scan operation, we can calculate all the cutting points based on the amount each person orders. That is, given an addition operation and an order input array [3 1 7 0 4 1 6 3], the inclusive scan operation returns [3 4 11 11 15 16 22 25]. The numbers in the return array are cutting locations. With this information, one can simultaneously make all the eight cuts that will generate the sections that each person ordered. The first cut point is at the 3-inch point so the first section will be 3 inches, as ordered by person number 0. The second cut point is 4, therefore the second section will be 1-inch long, as ordered by person number 1. The final cut will be at the 25-inch point, which will produce a 3-inch long section since the previous cut point is at the 22-inch point. This gives person number 7 what she ordered. Note that since all the cutting points are known from the scan operation, all cuts can be done in parallel.
总而言之,包容性扫描的直观思考方式是,操作从一群人那里接受订单,并识别出所有允许一次性服务订单的切入点。该订单可以是香肠、面包、露营地空间或计算机中的连续内存块。只要我们能够快速计算出所有的切割点,所有订单都可以并行送达。
In summary, an intuitive way of thinking about inclusive scan is that the operation takes an order from a group of people and identifies all the cutting points that allow the orders to be served all at once. The order could be for sausage, bread, campground space, or a contiguous chunk of memory in a computer. As long as we can quickly calculate all the cutting points, all orders can be served in parallel.
独占扫描操作与包含操作类似,不同之处在于它返回输出数组
An exclusive scan operation is similar to an inclusive operation with the exception that it returns the output array
[0, x 0 , ( x 0 ⊕ x 1 ), …, ( x 0 ⊕ x 1 ⊕ … ⊕ x n −2 )]
[0, x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn−2)]
也就是说,第一个输出元素为0,而最后一个输出元素仅反映最多x n -2的贡献。
That is, the first output element is 0 while the last output element only reflects the contribution of up to xn−2.
独占扫描操作的应用与包含扫描的应用几乎相同。包容性扫描提供的信息略有不同。在香肠示例中,独占扫描将返回[0 3 4 11 11 15 16 22],它们是切割部分的起点。例如,0 号人员的部分从 0 英寸点开始。再例如,7 号人员的部分从 22 英寸点开始。起始点信息在内存分配等应用中非常重要,其中分配的内存通过指向其起始点的指针返回给请求者。
The applications of an exclusive scan operation are pretty much the same as those for an inclusive scan. The inclusive scan provides slightly different information. In the sausage example, the exclusive scan would return [0 3 4 11 11 15 16 22], which are the beginning points of the cut sections. For example, the section for person number 0 starts at the 0-inch point. For another example, the section for person number 7 starts at the 22-inch point. The beginning point information is important in applications such as memory allocation, where the allocated memory is returned to the requester via a pointer to its beginning point.
请注意,在包含扫描输出和排除扫描输出之间进行转换相当容易。只需要进行移位并填充一个元素即可。当从包含转换为排除时,只需将所有元素向右移动并为 0 元素填充值 0。当从独占转换为包含时,我们需要将所有元素向左移动,并用前一个最后一个元素加上最后一个输入元素填充最后一个元素。无论我们关心切割点还是各部分的起始点,我们都可以直接生成包含或排他扫描,这只是一个方便的问题。
Note that it is fairly easy to convert between the inclusive scan output and the exclusive scan output. One simply needs to do a shift and fill in an element. When converting from inclusive to exclusive, one can simply shift all elements to the right and fill in value 0 for the 0 element. When converting from exclusive to inclusive, we need to shift all elements to the left and fill in the last element with the previous last element plus the last input element. It is just a matter of convenience that we can directly generate an inclusive or exclusive scan, whether we care about the cutting points or the beginning points for the sections.
在实践中,并行扫描通常用作并行算法中的基本操作,执行基数排序、快速排序、字符串比较、多项式求值、求解递归、树操作和直方图。
In practice, parallel scan is often used as a primitive operation in parallel algorithms that perform radix sort, quick sort, string comparison, polynomial evaluation, solving recurrences, tree operations, and histograms.
在介绍并行扫描算法及其实现之前,我们首先要展示一种高效的顺序包含扫描算法及其实现。我们假设该操作是加法。该算法假设输入元素位于x数组中,输出元素将写入y数组。
Before we present parallel scan algorithms and their implementations, we would like to first show a work-efficient sequential inclusive scan algorithm and its implementation. We will assume that the operation is addition. The algorithm assumes that the input elements are in the x array and the output elements are to be written into the y array.
voidequential_scan(float*x,float*y,intMax_i){
void sequential_scan(float ∗x, float ∗y, int Max_i) {
y[0] = x[0];
y[0] = x[0];
for (int i = 1; i < Max_i; i++) {
for (int i = 1; i < Max_i; i++) {
y[i] = y[i-1] + x[i];
y[i] = y [i-1] + x[i];
}
}
}
}
该算法工作效率高。使用相当好的编译器,在处理每个输入x元素时仅使用一次加法、一次内存加载和一次内存存储。这几乎是我们要做的最小的事情永远能够做到。正如我们将看到的,当计算的顺序算法如此“精简和平均”时,开发一种在数据集大小变大时能够始终击败顺序算法的并行算法极具挑战性。
The algorithm is work-efficient. With a reasonably good compiler, only one addition, one memory load, and one memory store are used in processing each input x element. This is pretty much the minimal that we will ever be able to do. As we will see, when the sequential algorithm of a computation is so “lean and mean,” it is extremely challenging to develop a parallel algorithm that will consistently beat the sequential algorithm when the data set size becomes large.
我们从一个简单的并行包含扫描算法开始,对所有输出元素进行归约操作。主要思想是通过计算每个输出元素的相关输入元素的归约树来快速创建每个元素。有多种方法可以为每个输出元素设计约简树。我们将展示一个简单的方案,如图9.1所示。
We start with a simple parallel inclusive scan algorithm by doing a reduction operation for all output elements. The main idea is to create each element quickly by calculating a reduction tree of the relevant input elements for each output element. There are multiple ways to design the reduction tree for each output element. We will present a simple one that is shown in Figure 9.1.
图 9.1一个简单但工作效率低下的并行包含扫描。
Figure 9.1 A simple but work-inefficient parallel inclusive scan.
该算法是一种就地扫描算法,对最初包含输入元素的数组XY进行操作。然后迭代地进化数组的内容转换为输出元素。在算法开始之前,我们假设XY[i]包含输入元素x i。在迭代n结束时,XY[i]将包含该位置处及其之前的2 n 个输入元素的总和。也就是说,在迭代 1 结束时,XY[i]将包含x i -1 + x i,在迭代 2 结束时,XY[i]将包含x i -3 + x i -2 + x i - 1 + x i,依此类推。
The algorithm is an in-place scan algorithm that operates on an array XY that originally contains input elements. It then iteratively evolves the contents of the array into output elements. Before the algorithm begins, we assume XY[i] contains input element xi. At the end of iteration n, XY[i] will contain the sum of 2n input elements at and before the location. That is, at the end of iteration 1, XY[i] will contain xi−1+xi and at the end of iteration 2, XY[i] will contain xi-3+xi−2+xi−1+xi, and so on.
图 9.1通过一个 16 元素输入示例说明了该算法。每条垂直线代表XY数组的一个元素,其中XY[0]位于最左边的位置。垂直方向显示迭代的进度,从图的顶部开始。对于包含扫描,根据定义,y 0是x 0,因此XY[0]包含其最终答案。在第一次迭代中,除XY[0]之外的每个位置接收其当前内容与其左邻居内容的总和。图 9.1中第一行加法运算符说明了这一点。结果,XY[i]包含x i -1 + x i。这反映在图 9.1中第一行加法运算符下方的标签框中。例如,第一次迭代后,XY[3]包含x 2 + x 3,显示为 Σ x 2 .. x 3。请注意,第一次迭代后,XY[1]等于x 0 + x 1,这是该位置的最终答案。因此,在后续迭代中XY[1]不应再发生任何更改。
Figure 9.1 illustrates the algorithm with a 16-element input example. Each vertical line represents an element of the XY array, with XY[0] in the leftmost position. The vertical direction shows the progress of iterations, starting from the top of the figure. For the inclusive scan, by definition, y0 is x0 so XY[0] contains its final answer. In the first iteration, each position other than XY[0] receives the sum of its current content and that of its left neighbor. This is illustrated by the first row of addition operators in Figure 9.1. As a result, XY[i] contains xi−1+xi. This is reflected in the labeling boxes under the first row of addition operators in Figure 9.1. For example, after the first iteration, XY[3] contains x2+x3, shown as ∑x2..x3. Note that after the first iteration, XY[1] is equal to x0+x1, which is the final answer for this position. So, there should be no further changes to XY[1] in subsequent iterations.
在第二次迭代中,除XY[0]和XY[1]之外的每个位置接收其当前内容与两个元素之外的位置的内容之和。这在第二行加法运算符下方的标签框中进行了说明。结果,XY[i]现在包含x i -3 + x i -2 + x i -1 + x i。例如,第一次迭代后,XY[3]包含x 0 + x 1 + x 2 + x 3,显示为 Σ x 0 .. x 3。请注意,在第二次迭代之后,XY[2]和XY[3]包含它们的最终答案,并且不需要在后续迭代中更改。
In the second iteration, each position other than XY[0] and XY[1] receive the sum of its current content and that of the position that is two elements away. This is illustrated in the labeling boxes below the second row of addition operators. As a result, XY[i] now contains xi−3+xi−2+xi−1+xi. For example, after the first iteration, XY[3] contains x0+x1+x2+x3, shown as ∑x0..x3. Note that after the second iteration, XY[2] and XY[3] contain their final answers and will not need to be changed in subsequent iterations.
我们鼓励读者完成其余的迭代。我们现在致力于图 9.1所示算法的实现。我们分配每个线程来演化一个XY元素的内容。我们将编写一个内核,对输入的一部分执行扫描,该部分足够小以供块处理。节的大小定义为编译时常量SECTION_SIZE。我们假设内核启动将使用SECTION_SIZE作为块大小,因此线程和节元素的数量将相同。所有结果都将被计算,就好像数组仅包含该部分中的元素一样。稍后,我们将对大型输入阵列的这些截面扫描结果进行最终调整。我们还假设输入值最初位于全局内存数组X中,其地址作为参数传递到内核中。我们将让块中的所有线程协作加载X数组元素到共享内存阵列XY中。在内核结束时,每个线程会将其结果写入指定的输出数组Y。
Readers are encouraged to work through the rest of the iterations. We now work on the implementation of the algorithm illustrated in Figure 9.1. We assign each thread to evolve the contents of one XY element. We will write a kernel that performs a scan on a section of the input that is small enough for a block to handle. The size of a section is defined as a compile-time constant SECTION_SIZE. We assume that the kernel launch will use SECTION_SIZE as the block size so there will be an equal number of threads and section elements. All results will be calculated as if the array only has the elements in the section. Later, we will make final adjustments to these sectional scan results for large input arrays. We also assume that input values were originally in a global memory array X, the address of which is passed into the kernel as an argument. We will have all the threads in the block to collaboratively load the X array elements into a shared memory array XY. At the end of the kernel, each thread will write its result into the assigned output array Y.
__global__ void work_inefficient_scan_kernel(浮点*X,浮点*Y,
__global__ void work_inefficient_scan_kernel(float ∗X, float ∗ Y,
int 输入大小) {
int InputSize) {
__shared__ 浮点 XY[SECTION_SIZE];
__shared__ float XY[SECTION_SIZE];
int i = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.x∗blockDim.x + threadIdx.x;
if (i < 输入大小) {
if (i < InputSize) {
XY[线程Idx.x] = X[i];
XY[threadIdx.x] = X[i];
}
}
// 下面的代码对 XY 执行迭代扫描
// the code below performs iterative scan on XY
……
…
Y[i] = XY[threadIdx.x];
Y[i] = XY[threadIdx.x];
}
}
我们现在重点关注图 9.1中每个XY元素的迭代计算作为for循环的实现:
We now focus on the implementation of the iterative calculations for each XY element in Figure 9.1 as a for loop:
for (unsigned int stride = 1; stride <= threadIdx.x; stride ∗= 2) {
for (unsigned int stride = 1; stride <= threadIdx.x; stride ∗= 2) {
__syncthreads();
__syncthreads();
XY[threadIdx.x] += XY[threadIdx.x-步幅];
XY[threadIdx.x] += XY[threadIdx.x-stride];
}
}
该循环迭代归约树以获取分配给线程的XY数组位置。请注意,我们使用屏障同步来确保所有线程在任何线程开始下一次迭代之前都已完成归约树中当前的添加迭代。这与__syncthreads()的用法相同;正如第 6 章中关于归约的讨论。当步幅值变得大于线程的threadIdx.x值时,这意味着该线程分配的XY位置已经累积了所有需要的输入值。这样,线程就可以退出while循环。threadIdx.x值越小,线程越早退出while循环。这与图9.1所示的例子是一致的。XY较小位置的动作比较大位置更早结束。当步幅值较小时,这将导致第一个扭曲中出现一定程度的控制发散。对于大块大小,效果应该相当适度,因为它只影响较小步幅值的第一个循环。详细分析留作练习。最终的内核如图9.2所示。
The loop iterates through the reduction tree for the XY array position that is assigned to a thread. Note that we use a barrier synchronization to make sure that all threads have finished their current iteration of additions in the reduction tree before any of them starts the next iteration. This is the same use of __syncthreads(); as in the reduction discussion in Chapter 6. When the stride value becomes greater than a thread’s threadIdx.x value, it means that the thread’s assigned XY position has already accumulated all the required input values. Thus, the thread can exit the while loop. The smaller the threadIdx.x value, the earlier the thread will exit the while loop. This is consistent with the example shown in Figure 9.1. The actions of the smaller positions of XY end earlier than the larger positions. This will cause some level of control divergence in the first warp when stride values are small. The effect should be quite modest for large block sizes since it only impacts the first loop for smaller stride values. The detailed analysis is left as an exercise. The final kernel is shown in Figure 9.2.
我们可以轻松地将包含扫描内核转换为独占扫描内核。回想一下,独占扫描相当于包含扫描,其中所有元素向右移动一个位置,并且元素 0 填充值 0。如图9.3所示。请注意,唯一真实的区别在于图片顶部元素的对齐方式。所有标签框已更新以反映新的对齐方式。所有迭代操作保持不变。
We can easily convert an inclusive scan kernel to an exclusive scan kernel. Recall that an exclusive scan is equivalent to an inclusive scan with all elements shifted to the right by one position and element 0 filled with value 0. This is illustrated in Figure 9.3. Note that the only real difference is the alignment of elements on top of the picture. All labeling boxes are updated to reflect the new alignment. All iterative operations remain the same.
图 9.3工作效率低下的并行独占扫描。
Figure 9.3 Work-inefficient parallel exclusive scan.
我们现在可以轻松地将图 9.2中的内核转换为独占扫描内核。我们唯一需要做的修改是将 0 加载到XY[0]中,将X[i-1]加载到XY[threadIdx.x]中,如以下代码所示:
We can now easily convert the kernel in Figure 9.2 into an exclusive scan kernel. The only modification we need to do is to load 0 into XY[0] and X[i-1] into XY[threadIdx.x], as shown in the following code:
if (i < InputSize && threadIdx.x != 0) {
if (i < InputSize && threadIdx.x != 0) {
XY[threadIdx.x] = X[i-1];
XY[threadIdx.x] = X[i-1];
} 别的 {
} else {
XY[线程Idx.x] = 0;
XY[threadIdx.x] = 0;
}
}
请注意,关联输入元素超出范围的XY位置现在也用 0 填充。这不会造成任何损害,但会稍微简化代码。我们把完成独占扫描内核的工作留作练习。
Note that the XY positions of which the associated input elements are outside the range are now also filled with 0. This causes no harm and yet it simplifies the code slightly. We leave the work to finish the exclusive scan kernel as an exercise.
我们现在分析图9.2中内核的工作效率。所有线程将迭代至 log( N ) 步,其中N是SECTION_SIZE。在每次迭代中,不需要进行任何加法的线程数量等于步幅大小。因此,我们可以计算算法完成的工作量:
We now analyze the work efficiency of the kernel in Figure 9.2. All threads will iterate up to log(N) steps, where N is the SECTION_SIZE. In each iteration, the number of threads that do not need to do any addition is equal to the stride size. Therefore, we can calculate the amount of work done for the algorithm as
每一项的第一部分与步幅无关,因此它们加起来为N ×log 2 ( N )。第二部分是熟悉的几何级数,总和为 ( N − 1)。所以加法运算的总数是
The first part of each term is independent of the stride, so they add up to N×log2(N). The second part is a familiar geometric series and sums up to (N − 1). So the total number of add operations is
回想一下,顺序扫描算法的加法运算次数为N − 1。我们可以通过比较不同N值的加法运算次数来理解这一点,如图9.4所示。请注意,即使对于中等大小的部分,图 9.2中的内核也会比顺序算法做更多的工作。在 1,024 个元素的情况下,内核执行的工作量是顺序代码的九倍。随着N变大,该比率将继续增长。这种额外的工作有两个方面的问题。首先,使用硬件来执行并行内核的效率需要低得多。事实上,为了实现收支平衡,并行机中的执行单元至少需要多九倍。顺序机。例如,如果我们在并行机上执行内核,其执行资源是顺序机的四倍,那么执行并行内核的并行机最终的性能仅为执行顺序代码的顺序机的一半。其次,所有额外的工作都会消耗额外的能量。这使得内核不适合功耗受限的环境,例如移动应用程序。
Recall that the number of add operations for a sequential scan algorithm is N − 1. We can put this into perspective by comparing the number of add operations for different N values, as shown in Figure 9.4. Note that even for modest-size sections, the kernel in Figure 9.2 does much more work than the sequential algorithm. In the case of 1,024 elements, the kernel does nine times more work than the sequential code. The ratio will continue to grow as N becomes larger. Such additional work is problematic in two ways. First, the use of hardware for executing the parallel kernel needs to be much less efficient. In fact, just to break even one needs to have at least nine times more execution units in a parallel machine than the sequential machine. For example, if we execute the kernel on a parallel machine with four times the execution resources as a sequential machine, the parallel machine executing the parallel kernel can end up with only half the performance of the sequential machine executing the sequential code. Second, all the extra work consumes additional energy. This makes the kernel inappropriate for power-constrained environments such as mobile applications.
虽然图9.2中的内核在概念上很简单,但对于许多实际应用来说,其工作效率太低。只需检查图 9.1和9.3,我们就可以看到共享一些中间结果以简化所执行的操作的潜在机会。然而,为了允许更多的多线程共享,我们需要快速计算出要共享的中间结果,然后快速分发到不同的线程。
While the kernel in Figure 9.2 is conceptually simple, its work efficiency is too low for many practical applications. Just by inspecting Figures 9.1 and 9.3, we can see that there are potential opportunities for sharing some intermediate results to streamline the operations performed. However, to allow more sharing across multiple threads, we need to quickly calculate the intermediate results to be shared and then quickly distribute them to different threads.
众所周知,为一组值生成总和的最快并行方法是归约树。约简树可以在 log 2 ( N ) 步中生成N 个值的总和。此外,树还可以生成多个子和,这些子和可以用于计算一些扫描输出值。
As we know, the fastest parallel way to produce sum values for a set of values is a reduction tree. A reduction tree can generate the sum for N values in log2(N) steps. Furthermore, the tree can also generate a number of subsums that can be in the calculation of some of the scan output values.
在图 9.5中,我们分四步计算所有 16 个元素的总和。我们使用生成总和所需的最少操作数。在第一步中,只有XY[i]的奇数元素将更改为x i -1 + x i。在第二步中,仅更新索引为 4×n − 1 形式的XY元素,即图 9.5中的 3、7、11 和 15 。在第三步中,仅更新索引为 8×n − 1 形式的XY元素,即 7 和 15。最后,在第四步中,仅更新XY[15] 。总数执行的操作是8+4+2+1=15。一般来说,对于N 个元素的扫描部分,我们会在这个缩减阶段执行 (N/2)+( N /4) +…+ 2+1= N − 1 次操作。
In Figure 9.5, we produce the sum of all 16 elements in four steps. We use the minimal number of operations needed to generate the sum. During the first step, only the odd element of XY[i] will be changed to xi−1+xi. During the second step, only the XY elements of which the indices are of the form of 4×n − 1, which are 3, 7, 11, and 15 in Figure 9.5, will be updated. During the third step, only the XY elements of which the indices are of the form 8×n − 1, which are 7 and 15, will be updated. Finally, during the fourth step, only XY[15] is updated. The total number of operations performed is 8+4+2+1=15. In general, for a scan section of N elements, we would do (N/2)+(N/4) +…+ 2+1=N − 1 operations for this reduction phase.
图 9.5高效工作的并行扫描算法的基本思想。
Figure 9.5 Basic idea of a work-efficient parallel scan algorithm.
该算法的第二部分是使用反向树将部分和尽快分配到可以使用这些值的位置。图 9.5的下半部分对此进行了说明。在归约阶段结束时,我们有相当多可用的部分和。对于我们的示例,图 9.6的第一行显示了顶部归约树之后XY中的所有部分和。一个重要的观察结果是XY[0]、XY[7]和X[15]包含它们的最终答案。因此,所有剩余的XY元素都可以从不超过四个位置处获得它们所需的部分和。例如,XY[14]可以获得它需要的所有部分和来自XY[7]、XY[11]和XY[13]。为了组织加法运算的后半部分,我们将首先展示需要从四个位置开始部分求和的所有运算,然后是两个位置之外,然后是一个位置方向。经检查,XY[7]包含右半部分许多位置所需的临界值。一个好方法是将XY[7]添加到XY[11],这会将XY[11]带到最终答案。更重要的是,XY[7]也成为XY[12]、XY[13]和XY[14]的良好部分和。没有其他部分和有如此多的用途。因此,在图 9.5 的四位置层中只需要进行一次加法,即XY[11] = XY[7] + XY[11]。我们在图 9.6的第二行中显示了更新后的部分和。
The second part of the algorithm is to use a reverse tree to distribute the partial sums to the positions that can use these values as quickly as possible. This is illustrated in the bottom half of Figure 9.5. At the end of the reduction phase, we have quite a few usable partial sums. For our example, the first row of Figure 9.6 shows all the partial sums in XY right after the top reduction tree. An important observation is that XY[0], XY[7], and X[15] contain their final answers. Therefore, all remaining XY elements can obtain the partial sums they need from no farther than four positions away. For example, XY[14] can obtain all the partial sums it needs from XY[7], XY[11], and XY[13]. To organize our second half of the addition operations, we will first show all the operations that need partial sums from four positions away, then two positions away, then one position way. By inspection, XY[7] contains a critical value needed by many positions in the right half. A good way is to add XY[7] to XY[11], which brings XY[11] to the final answer. More importantly, XY[7] also becomes a good partial sum for XY[12], XY[13], and XY[14]. No other partial sums have so many uses. Therefore, there is only one addition, XY[11]=XY[7]+XY[11], that needs to occur in the four-position level in Figure 9.5. We show the updated partial sum in the second row of Figure 9.6.
图 9.6缩减树阶段后每个XY元素中可用的部分和。
Figure 9.6 Partial sums available in each XY element after the reduction tree phase.
我们现在确定所有加法以获得两个位置之外的部分和。我们看到XY[2]只需要XY[1]中紧邻它的部分和。这与XY[4]相同——它需要它旁边的部分和才能完成。需要对两个位置进行部分求和的第一个XY元素是XY[5]。一旦我们计算XY[5] = XY[3] + XY[5],XY[5]就包含最终答案。相同的分析表明XY[6]和XY[8]可以通过XY[5]和XY[7]中它们旁边的部分和来完成。
We now identify all additions for getting partial sums that are two positions away. We see that XY[2] only needs the partial sum that is next to it in XY[1]. This is the same with XY[4]—it needs the partial sum next to it to be complete. The first XY element that can need a partial sum two positions away is XY[5]. Once we calculate XY[5]=XY[3]+XY[5], XY[5] contains the final answer. The same analysis shows that XY[6] and XY[8] can become complete with the partial sums next to them in XY[5] and XY[7].
下一个两位加法是XY[9] = XY[7] + XY[9],这使得XY[9]完整。XY[10]可以等待下一轮来赶上XY[9]。 XY[12] 仅需要XY[11],其中包含四位加法后的最终答案。最终的两位加法是XY[13] = XY[11] + XY[13]。第三行显示XY[5]、XY[9]和XY[13]中所有更新的部分和。很明显,现在每个位置要么是完整的,要么是在由其左邻居添加时可以完成的。这导致图 9.5中的最后一行添加,它完成了所有不完整位置XY[2]、XY[4]、XY[6]、XY[8]、XY[10]和XY[12的内容]。
The next two-position addition is XY[9]=XY[7]+XY[9], which makes XY[9] complete. XY[10] can wait for the next round to catch XY[9]. XY[12] only needs the XY[11], which contains its final answer after the four-position addition. The final two-position addition is XY[13]=XY[11]+XY[13]. The third row shows all the updated partial sums in XY[5], XY[9], and XY[13]. It is clear that now every position is either complete or can be completed when added by its left neighbor. This leads to the final row of additions in Figure 9.5, which completes the contents for all the incomplete positions XY[2], XY[4], XY[6], XY[8], XY[10], and XY[12].
我们可以使用以下循环来实现并行扫描的缩减树阶段:
We could implement the reduction tree phase of the parallel scan using the following loop:
for (unsiged int stride = 1; stride < threadDim.x; stride ∗= 2) {
for (unsiged int stride = 1; stride < threadDim.x; stride ∗= 2) {
__synchthreads();
__synchthreads();
if ((threadIdx.x + 1)%(2*stride) == 0) {
if ((threadIdx.x + 1)%(2∗stride) == 0) {
XY[threadIdx.x] += XY[threadIdx.x - 步幅];
XY[threadIdx.x] += XY[threadIdx.x - stride];
}
}
}
}
请注意,该循环与图 6.2中的简化非常相似。唯一的区别是,我们希望线程索引为 2 n − 1形式的线程在每次迭代中执行加法,而不是 2 n 。这就是为什么我们在每次迭代中选择执行加法的线程时,将threadIdx.x加 1 。然而,这种减少风格已知存在控制发散问题。更好的方法是随着循环的进行,使用越来越少的连续线程来执行加法:
Note that this loop is very similar to the reduction in Figure 6.2. The only difference is that we want the thread that has a thread index that is in the form of 2n − 1, rather than 2n to perform addition in each iteration. This is why we added 1 to the threadIdx.x when we select the threads for performing addition in each iteration. However, this style of reduction is known to have control divergence problems. A better way to do this is to use a decreasing number of contiguous threads to perform the additions as the loop advances:
for (unsigned int stride = 1; stride < blockDim.x; stride ∗= 2) {
for (unsigned int stride = 1; stride < blockDim.x; stride ∗= 2) {
__syncthreads();
__syncthreads();
int索引=(threadIdx.x+1)*2*步幅-1;
int index = (threadIdx.x+1) ∗ 2∗ stride -1;
if (索引 < blockDim.x) {
if (index < blockDim.x) {
XY[索引] += XY[索引 - 步幅];
XY[index] += XY[index - stride];
}
}
}
}
在图 9.5的示例中,块中有 16 个线程。在第一次迭代中,步幅等于 1。块中的前 8 个连续线程将满足if条件。为这些线程计算的索引值将为 1、3、5、7、9、11、13 和 15。这些线程将执行图 9.5中的第一行加法。在第二次迭代中,步幅等于 2。只有块中的前 4 个线程才会满足if条件。为这些线程计算的索引值将为 3、7、11 和 15。这些线程将执行图 9.5中的第二行加法。请注意,由于我们在每次迭代中始终使用连续的线程,因此直到活动线程数降至扭曲大小以下时,才会出现控制发散问题。
In our example in Figure 9.5, there are 16 threads in the block. In the first iteration, the stride is equal to 1. The first 8 consecutive threads in the block will satisfy the if condition. The index values calculated for these threads will be 1, 3, 5, 7, 9, 11, 13, and 15. These threads will perform the first row of additions in Figure 9.5. In the second iteration, the stride is equal to 2. Only the first 4 threads in the block will satisfy the if condition. The index values calculated for these threads will be 3, 7, 11, and 15. These threads will perform the second row of additions in Figure 9.5. Note that since we will always be using consecutive threads in each iteration, the control divergence problem does not arise until the number of active threads drops below the warp size.
分发树的实现稍微复杂一些。我们观察到步幅值从SECTION_SIZE/2减小到 1。在每次迭代中,我们需要将XY元素的值从步幅值负 1 的倍数的位置“推”到一步之遥。例如,在图 9.5中,步长值从 8 减小到 1。在图 9.5中的第一次迭代中,我们希望将XY[7]的值推送到XY[11],其中 7 是 8−1。在第二次迭代中,我们希望将XY[3]、XY[7]和XY[11]的值推送到XY[5]、XY[9]和XY[13]。这可以通过以下循环来实现:
The distribution tree is a little more complex to implement. We make an observation that the stride value decreases from SECTION_SIZE/2 to 1. In each iteration, we need to “push” the value of the XY element from a position that is a multiple of the stride value minus 1 to a position that is a stride away. For example, in Figure 9.5, the stride value decreases from 8 to 1. In the first iteration in Figure 9.5, we would like to push the value of XY[7] to XY[11], where 7 is 8−1. In the second iteration, we would like to push the values of XY[3], XY[7], and XY[11] to XY[5], XY[9], and XY[13]. This can be implemented with the following loop:
for (int stride = SECTION_SIZE/4; stride > 0; stride /= 2) {
for (int stride = SECTION_SIZE/4; stride > 0; stride /= 2) {
__syncthreads();
__syncthreads();
int索引 = (threadIdx.x+1)*stride*2-1;
int index = (threadIdx.x+1)∗stride∗2-1;
if(索引 + 步幅 < BLOCK_SIZE) {
if(index + stride < BLOCK_SIZE) {
XY[索引 + 步幅] += XY[索引];
XY[index + stride] += XY[index];
}
}
}
}
索引的计算与约简树阶段类似。高效工作的并行扫描的最终内核如图9.7所示。读者应该注意到,对于缩减阶段或分发阶段,我们永远不需要超过SECTION_SIZE/2 个线程。因此,我们可以简单地启动一个块中包含SECTION_SIZE/2 个线程的内核。由于块中最多可以有 1,024 个线程,因此每个扫描部分最多可以有 2,048 个元素。但是,我们需要让每个线程在开始时加载两个X元素并在末尾存储两个Y元素。这将留作练习。
The calculation of index is similar to that in the reduction tree phase. The final kernel for a work-efficient parallel scan is shown in Figure 9.7. Readers should notice that we never need to have more than SECTION_SIZE/2 threads for either the reduction phase or the distribution phase. So, we could simply launch a kernel with SECTION_SIZE/2 threads in a block. Since we can have up to 1,024 threads in a block, each scan section can have up to 2,048 elements. However, we will need to have each thread to load two X elements at the beginning and store two Y elements at the end. This will be left as an exercise.
图 9.7用于包容性扫描的高效工作内核。
Figure 9.7 A work-efficient kernel for an inclusive scan.
与工作效率低下的扫描内核的情况一样,只需对将X元素加载到XY 的语句进行细微调整,就可以轻松地将工作效率高的包容性并行扫描内核改编为独占扫描内核。有兴趣的读者还应该阅读[Harris 2007],了解一个有趣的本机独占扫描内核,该内核基于设计扫描内核的分发树阶段的不同方式。
As was the case of the work-inefficient scan kernel, one can easily adapt the work-efficient inclusive parallel scan kernel into an exclusive scan kernel with a minor adjustment to the statement that loads X elements into XY. Interested readers should also read [Harris 2007] for an interesting natively exclusive scan kernel that is based on a different way of designing the distribution tree phase of the scan kernel.
现在我们分析一下分布树阶段的操作数。运算次数为 (16/8) − 1+(16/4)+(16/2)。一般来说,对于N 个输入元素,操作总数将为 ( N /2)+( N /4)+…+ 4+2−1,小于N − 2。这使得并行扫描中的操作总数为 2×N − 3。请注意,操作数量现在与N成正比,而不是与N成正比×log 2 ( N )。我们比较了图 9.8中N从 16 到 1,024 的两种算法执行的运算数量。
We now analyze the number of operations in the distribution tree stage. The number of operations are (16/8) − 1+(16/4)+(16/2). In general, for N input elements, the total number of operations would be (N/2)+(N/4) +…+ 4+2−1, which is less than N − 2. This makes the total number of operations in the parallel scan 2×N − 3. Note that the number of operations is now a proportional to N, rather than N×log2(N). We compare the number of operations performed by the two algorithms for N from 16 to 1,024 in Figure 9.8.
图 9.8内核的工作效率。
Figure 9.8 Work efficiency of the kernels.
比较中,高效算法的优势非常明显。当输入部分变大时,高效工作算法所执行的操作数永远不会超过顺序算法所执行操作数的两倍。只要我们拥有至少两倍以上的硬件执行资源,并行算法就会获得比顺序算法更好的性能。然而,对于工作效率低下的算法来说,情况并非如此。对于 102 个元素,并行算法至少需要九倍的硬件执行资源才能达到收支平衡。
The advantage of a work-efficient algorithm is quite clear in the comparison. As the input section becomes bigger, the work-efficient algorithm never performs more than two times the number of operations performed by the sequential algorithm. As long as we have at least two times more hardware execution resources, the parallel algorithm will achieve better performance than the sequential algorithm. This is not true, however, for the work-inefficient algorithm. For 102 elements, the parallel algorithm needs at least nine times the hardware execution resources just to break even.
对于许多应用程序来说,扫描操作要处理的元素数量可能达到数百万。显然,我们不能期望所有输入元素都能放入共享内存中。此外,如果我们只使用一个线程块来处理这些大数据集,就会失去并行机会。幸运的是,有一种分层方法可以扩展我们迄今为止生成的扫描内核,以处理任意大小的输入。该方法如图 9.9所示。
For many applications, the number of elements to be processed by a scan operation can be in the millions. Obviously, we cannot expect that all input elements can fit into the shared memory. Furthermore, it would be a loss of parallelism opportunity if we used only one thread block to process these large data sets. Fortunately, there is a hierarchical approach to extending the scan kernels that we have generated so far to handle inputs of arbitrary size. The approach is illustrated in Figure 9.9.
图 9.9任意长度输入的分层扫描。
Figure 9.9 A hierarchical scan for arbitrary-length inputs.
对于大型数据集,我们首先将输入划分为适合共享内存并由单个块处理的部分。对于当前一代 CUDA 设备,图 9.8中的高效内核可以使用每个块中的 1,024 个线程处理每个部分中最多 2,048 个元素。例如,如果输入数据由 2,000,000 个元素组成,我们可以使用 ceil(2,000,000/2,048.0)=977 个线程块。在网格的x维度上有多达 65,536 个线程块,该方法可以处理多达输入集中有 134,217,728 个元素。如果输入比这更大,我们可以使用额外的层次结构来处理真正任意数量的输入元素。然而,在本章中,我们将讨论限制为最多可以处理 134,217,728 个元素的两级层次结构。
For a large data set, we first partition the input into sections that can fit into the shared memory and processed by a single block. For the current generation of CUDA devices, the work-efficient kernel in Figure 9.8 can process up to 2,048 elements in each section using 1,024 threads in each block. For example, if the input data consists of 2,000,000 elements, we can use ceil(2,000,000/2,048.0)=977 thread blocks. With up to 65,536 thread blocks in the x dimension of a grid, the approach can process up to 134,217,728 elements in the input set. If the input is even bigger than this, we can use additional levels of hierarchy to handle a truly arbitrary number of input elements. However, for this chapter, we will restrict our discussion to a two-level hierarchy that can process up to 134,217,728 elements.
假设主机代码在输入上启动图 9.7中的内核。请注意,内核使用熟悉的i=blockIdx*blockDim.x+trheadIdx.x语句来指示每个块中的线程从适当的部分加载其输入值。在网格执行结束时,线程将其结果写入Y数组。也就是说,图 9.7中的内核完成后,Y数组包含各个部分的扫描结果,在图 9.9中称为扫描块。扫描块中的每个结果仅包含同一扫描块中所有先前元素的累加值。这些扫描块需要组合成最终结果。也就是说,我们需要编写并启动另一个内核,将前面扫描块中所有元素的总和添加到扫描块的每个元素中。
Assume that the host code launches the kernel in Figure 9.7 on the input. Note that the kernel uses the familiar i=blockIdx∗blockDim.x+trheadIdx.x statement to direct threads in each block to load their input values from the appropriate section. At the end of the grid execution, the threads write their results into the Y array. That is, after the kernel in Figure 9.7 completes, the Y array contains the scan results for individual sections, called scan blocks in Figure 9.9. Each result in a scan block only contains the accumulated values of all preceding elements in the same scan block. These scan blocks need to be combined into the final result. That is, we need to write and launch another kernel that adds the sum of all elements in preceding scan blocks to each element of a scan block.
图 9.10显示了图 9.9的分层扫描方法的一个小型操作示例。在此示例中,有 16 个输入元素,分为四个扫描块。内核将四个扫描块视为独立的输入数据集。扫描内核终止后,每个Y元素包含扫描结果及其扫描块。例如,扫描块 1 具有输入 0、4、1、2。扫描内核生成该部分的扫描结果 (0、4、5、7)。请注意,这些结果不包含扫描块 0 中任何元素的贡献。要生成此扫描块的最终结果,扫描块 0 中所有元素的总和 (2+1+3+1=7) 应添加到扫描块 1 的每个结果元素。
Figure 9.10 shows a small operational example of the hierarchical scan approach of Figure 9.9. In this example, there are 16 input elements that are divided into four scan blocks. The kernel treats the four scan blocks as independent input data sets. After the scan kernel terminates, each Y element contains the scan result with its scan block. For example, scan block 1 has inputs 0, 4, 1, 2. The scan kernel produces the scan result for this section (0, 4, 5, 7). Note that these results do not contain the contributions from any of the elements in scan block 0. To produce the final result for this scan block, the sum of all elements in scan block 0 (2+1+3+1=7) should be added to every result element of scan block 1.
图 9.10分层扫描的示例。
Figure 9.10 An example of a hierarchical scan.
再例如,扫描块2中的输入是0、3、1和2。内核产生该扫描块的扫描结果(0、3、4、6)。为了生成此扫描块的最终结果,扫描块 0 和 1 中所有元素的总和 (2+1+3+1+0+4+1+2=14) 应添加到扫描的每个结果元素中块 2.
For another example, the inputs in scan block 2 are 0, 3, 1, and 2. The kernel produces the scan result for this scan block (0, 3, 4, 6). To produce the final results for this scan block, the sum of all elements in both scan blocks 0 and 1 (2+1+3+1+0+4+1+2=14) should be added to every result element of scan block 2.
值得注意的是,每个扫描块的最后一个扫描输出元素给出了该扫描块的所有输入元素的总和。这些值为图 9.10中的 7、7、6 和 11 。这给我们带来了图 9.9中分层扫描算法的第二步,它将每个扫描块的最后结果元素收集到一个数组中,并对这些元素执行扫描。输出元素。图 9.10也说明了此步骤,其中最后扫描输出元素全部收集到新的数组S中。这可以通过更改扫描内核末尾的代码来完成,以便每个块的最后一个线程使用其blockIdx.x作为索引将其结果写入S数组。然后对S进行扫描操作,产生输出值7、14、20和31。注意,这些第二级扫描输出值中的每一个都是从起始位置X[0]到每次扫描结束的累加和。堵塞。也就是说,S[0]=7中的输出值为从X[0]到扫描块0的末尾的累加和,即X[3]。 S[1]=14中的输出值为从X[0]到扫描块1的末尾的累加和,即X[7]。
It is important to note that the last scan output element of each scan block gives the sum of all input elements of the scan block. These values are 7, 7, 6, and 11 in Figure 9.10. This brings us to the second step of the hierarchical scan algorithm in Figure 9.9, which gathers the last result elements from each scan block into an array and performs a scan on these output elements. This step is also illustrated in Figure 9.10, where the last scan output elements are all collected into a new array S. This can be done by changing the code at the end of the scan kernel so that the last thread of each block writes its result into an S array using its blockIdx.x as the index. A scan operation is then performed on S to produce output values 7, 14, 20, and 31. Note that each of these second-level scan output values are the accumulated sum from the beginning location X[0] to the end of each scan block. That is, the output value in S[0]=7 is the accumulated sum from X[0] to the end of scan block 0, which is X[3]. The output value in S[1]=14 is the accumulated sum from X[0] to the end of scan block 1, which is X[7].
因此, S数组中的输出值给出了原始扫描问题的“战略”位置处的扫描结果。也就是说,在图 9.10中, S[0]、S[1]、S[2]和S[3]中的输出值给出了位置X[3]、X[7]处原始问题的最终扫描结果]、X[11]和X[15]。这些结果可用于使每个扫描块中的部分结果达到其最终值。这将我们带到图 9.9中分层扫描算法的最后一步。第二级扫描输出值被添加到其对应扫描块的值上。
Therefore, the output values in the S array give the scan results at “strategic” locations of the original scan problem. That is, in Figure 9.10, the output values in S[0], S[1], S[2], and S[3] give the final scan results for the original problem at positions X[3], X[7], X[11], and X[15]. These results can be used to bring the partial results in each scan block to their final values. This brings us to the last step of the hierarchical scan algorithm in Figure 9.9. The second-level scan output values are added to the values of their corresponding scan blocks.
例如,在图 9.10中, S[0] (7)的值将与线程块 1 的Y[0]、Y[1]、Y[2]和Y[3]相加,从而完成结果在这些职位上。这些位置的最终结果是 7、11、12 和 14。这是因为S[0]包含原始输入X[0]到X[3]的值之和。这些最终结果是 14、17、18 和 20。S[1] (14)的值将添加到Y[8]、Y[9]、Y[10]和Y[11],从而完成这些职位的结果。S[2]的值将添加到S[2] (20),后者将添加到Y[12]、Y[13]、Y[14]和Y[15]。最后, S[3]的值是原始输入所有元素的总和,这也是Y[15]中的最终结果。
For example, in Figure 9.10, the value of S[0] (7) will be added to Y[0], Y[1], Y[2], and Y[3] of thread block 1, which completes the results in these positions. The final results in these positions are 7, 11, 12, and 14. This is because S[0] contains the sum of the values of the original input X[0] through X[3]. These final results are 14, 17, 18, and 20. The value of S[1] (14) will be added to Y[8], Y[9], Y[10], and Y[11], which completes the results in these positions. The value of S[2] will be added to S[2] (20), which will be added to Y[12], Y[13], Y[14], and Y[15]. Finally, the value of S[3] is the sum of all elements of the original input, which is also the final result in Y[15].
熟悉计算机算术算法的读者应该认识到,分层扫描算法与现代处理器的硬件加法器中的先行进位非常相似。
Readers who are familiar with computer arithmetic algorithms should recognize that the hierarchical scan algorithm is quite similar to the carry look-ahead in hardware adders of modern processors.
我们可以用三个内核来实现分层扫描。第一个内核与图 9.7中的内核基本相同。我们需要再添加一个参数S,其尺寸为InputSize/SECTION_SIZE。在内核的末尾,我们为块中的最后一个线程添加条件语句,以将扫描块中最后一个XY元素的输出值写入S的blockIdx.x位置:
We can implement the hierarchical scan with three kernels. The first kernel is largely the same as the kernel in Figure 9.7. We need to add one more parameter S, which has the dimension of InputSize/SECTION_SIZE. At the end of the kernel, we add a conditional statement for the last thread in the block to write the output value of the last XY element in the scan block to the blockIdx.x position of S:
__synchtrheads();
__synchtrheads();
S[blockIdx.x] = XY[SECTION_SIZE – 1];
S[blockIdx.x] = XY[SECTION_SIZE – 1];
}
}
第二个内核与图 9.7相同,它将S作为输入并写入S作为输出。
The second kernel is simply the same kernel as Figure 9.7, which takes S as input and writes S as output.
第三个内核将S和Y数组作为输入,并将输出写回Y。内核主体将S元素之一添加到所有Y元素中:
The third kernel takes the S and Y arrays as inputs and writes the output back into Y. The body of the kernel adds one of the S elements to all Y elements:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.x ∗ blockDim.x + threadIdx.x;
Y[i] += S[blockIdx.x];
Y[i] += S[blockIdx.x];
我们将其作为练习,供读者完成每个内核的详细信息并完成主机代码。
We leave it as an exercise for readers to complete the details of each kernel and complete the host code.
在本章中,我们研究了扫描作为一种重要的并行计算模式。 Scan用于实现需求不统一的资源的并行分配。它将看似顺序的递归计算转换为并行计算,这有助于减少许多应用程序中的顺序瓶颈。我们展示了一个简单的顺序扫描算法仅对N 个元素的输入执行N 个加法。
In this chapter, we studied scan as an important parallel computation pattern. Scan is used to enable parallel allocation of resources parties of which the needs are not uniform. It converts seemingly sequential recursive computation into parallel computation, which helps to reduce sequential bottlenecks in many applications. We showed that a simple sequential scan algorithm performs only N additions for an input of N elements.
我们首先介绍了一种并行扫描算法,该算法概念上很简单,但工作效率不高。随着数据集大小的增加,并行算法与简单顺序算法保持平衡所需的执行单元数量也会增加。对于 1,024 个元素的输入,并行算法执行的加法比顺序算法多九倍以上,并且需要至少多九倍的执行次数才能与顺序算法收支平衡。这使得工作效率低下的并行算法不适合移动应用等功率有限的环境。
We first introduced a parallel scan algorithm that is conceptually simple but not work-efficient. As the data set size increases, the number of execution units needed for a parallel algorithm to break even with the simple sequential algorithm also increases. For an input of 1,024 elements, the parallel algorithm performs over nine times more additions than the sequential algorithm and requires at least nine times more executions to break even with the sequential algorithm. This makes the work-inefficient parallel algorithm inappropriate for power-limited environments such as mobile applications.
然后,我们提出了一种概念上更复杂的高效工作并行扫描算法。使用归约树阶段和分布树阶段,无论输入数据集有多大,该算法仅执行2× N -3 次加法。这种操作数量随输入集大小线性增长的工作高效算法通常也称为数据可扩展算法。我们还提出了一种分层方法来扩展工作高效的并行扫描算法以处理任意大小的输入集。
We then presented a work-efficient parallel scan algorithm that is conceptually more complicated. Using a reduction tree phase and a distribution tree phase, the algorithm performs only 2×N−3 additions no matter how large the input data sets are. Such work-efficient algorithms of which the number of operations grows linearly with the size of the input set are often also referred to as data-scalable algorithms. We also presented a hierarchical approach to extending the work-efficient parallel scan algorithm to handle the input sets of arbitrary sizes.
9.1. 分析图9.2中的并行扫描内核。表明控制发散仅发生在每个块的第一个扭曲中,步幅值高达扭曲大小的一半。也就是说,对于扭曲大小 32,步幅值 1、2、4、8 和 16 的迭代将出现控制发散。
9.1. Analyze the parallel scan kernel in Figure 9.2. Show that control divergence only occurs in the first warp of each block for stride values up to half of the warp size. That is, for warp size 32, control divergence will occur to iterations for stride values 1, 2, 4, 8, and 16.
9.2. 对于工作高效的扫描内核,假设我们有 2,048 个元素。在约简树阶段和逆约简树阶段都会执行多少个加法操作?
9.2. For the work-efficient scan kernel, assume that we have 2,048 elements. How many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase?
9.3. 对于基于约简树的工作效率低下的扫描内核,假设我们有 2,048 个元素。以下哪项给出了将执行多少加法运算的最接近的近似值?
9.3. For the work-inefficient scan kernel based on reduction trees, assume that we have 2,048 elements. Which of the following gives the closest approximation on how many add operations will be performed?
9.4. 使用图9.3的算法完成一次独占扫描内核。
9.4. Use the algorithm in Figure 9.3 to complete an exclusive scan kernel.
9.5。 完成图 9.9中分层并行扫描算法的主机代码和所有三个内核。
9.5. Complete the host code and all the three kernels for the hierarchical parallel scan algorithm in Figure 9.9.
9.6。 分析分层并行扫描算法,表明该算法工作效率高,加法总数不超过4× N −3。
9.6. Analyze the hierarchical parallel scan algorithm and show that it is work-efficient and the total number of additions is no more than 4×N − 3.
9.7. 考虑以下数组:[4 6 7 1 2 8 5 2]。使用工作效率低下的算法对阵列执行并行包含前缀扫描。每个步骤后报告阵列的中间状态。
9.7. Consider the following array: [4 6 7 1 2 8 5 2]. Perform a parallel inclusive prefix scan on the array using the work-inefficient algorithm. Report the intermediate states of the array after each step.
9.8. 使用高效算法重复练习 9.7 。
9.8. Repeat Exercise 9.7 using the work-efficient algorithm.
9.9. 使用第 9.5 节中讨论的两级分层扫描,如果计算以下内容,则可以处理的最大可能数据集是多少:
9.9. Using the two-level hierarchical scan discussed in Section 9.5, what is the largest possible data set that can be handled if computing on a:
1. Harris, M. CUDA 并行前缀和,可在:2007 < http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf > 获取。
1. Harris, M. Parallel Prefix Sum with CUDA, Available at: 2007 <http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf>.
10.1 背景
10.1 Background
10.2 使用 CSR 的并行 SpMV
10.2 Parallel SpMV Using CSR
10.3 填充和转置
10.3 Padding and Transposition
10.4 使用混合控制填充
10.4 Using Hybrid to Control Padding
10.5 正则化的排序和分区
10.5 Sorting and Partitioning for Regularization
10.6 概括
10.6 Summary
10.7 练习
10.7 Exercises
我们的下一个并行模式是稀疏矩阵计算。在稀疏矩阵中,绝大多数元素为零。存储和处理这些零元素在内存、时间和能量方面都是浪费的。许多重要的现实问题涉及本质上高度并行的稀疏矩阵计算。由于这些问题的重要性,几种稀疏矩阵存储格式及其相应的处理方法被提出并在该领域得到广泛应用。它们都采用某种类型的压缩技术来避免存储或处理零元素,但代价是在数据表示中引入某种程度的不规则性。不幸的是,这种不规则性可能导致并行计算中内存带宽的利用不足、控制流发散和负载不平衡。因此,在压缩和正则化之间取得良好的平衡非常重要。一些存储格式在高度不规则性的情况下实现了更高水平的压缩。其他人实现了更适度的压缩水平,同时保持表示更规则。其对应的并行计算性能众所周知,这些方法严重依赖于稀疏矩阵中非零元素的分布。了解稀疏矩阵存储格式及其相应并行算法的大量工作,为并行程序员在解决相关问题时解决压缩和正则化挑战提供了重要的背景。
Our next parallel pattern is sparse matrix computation. In a sparse matrix, the vast majority of the elements are zeros. Storing and processing these zero elements are wasteful in terms of memory, time, and energy. Many important real-world problems involve sparse matrix computations that are highly parallel in nature. Due to the importance of these problems, several sparse matrix storage formats and their corresponding processing methods have been proposed and widely used in the field. All of them employ some type of compaction techniques to avoid storing or processing zero elements at the cost of introducing some level of irregularity into the data representation. Unfortunately, such irregularity can lead to underutilization of memory bandwidth, control flow divergence, and load imbalance in parallel computing. It is therefore important to strike a good balance between compaction and regularization. Some storage formats achieve a higher level of compaction at a high level of irregularity. Others achieve a more modest level of compaction while keeping the representation more regular. The parallel computation performance of their corresponding methods is known to be heavily dependent on the distribution of nonzero elements in the sparse matrices. Understanding the wealth of work in sparse matrix storage formats and their corresponding parallel algorithms gives a parallel programmer an important background for addressing compaction and regularization challenges in solving related problems.
稀疏矩阵是大多数元素为零的矩阵。稀疏矩阵出现在许多科学、工程和金融建模问题中。例如,正如我们在第 7 章中看到的,矩阵通常用于表示线性方程组中的系数。矩阵的每一行代表线性系统的一个方程。在许多科学和工程问题中,存在大量的变量,所涉及的方程是松耦合的。也就是说,每个方程仅涉及少量变量。如图 10.1所示,其中方程 0 中涉及变量x 0和x 2,方程 1 中没有变量,方程 2 中涉及变量x 1、x 2和x 3,最后变量x 0和x 3在等式3中。
A sparse matrix is a matrix where the majority of the elements are zero. Sparse matrices arise in many science, engineering, and financial modeling problems. For example, as we saw in Chapter 7, matrices are often used to represent the coefficients in a linear system of equations. Each row of the matrix represents one equation of the linear system. In many science and engineering problems, there are a large number of variables and the equations involved are loosely coupled. That is, each equation only involves a small number of variables. This is illustrated in Figure 10.1, where variables x0 and x2 are involved in equation 0, none of the variables in equation 1, variables x1, x2, and x3 in equation 2, and finally variables x0 and x3 in equation 3.
图 10.1一个简单的稀疏矩阵示例。
Figure 10.1 A simple sparse matrix example.
稀疏矩阵以避免存储零元素的格式存储。我们将从压缩稀疏行(CSR)存储格式开始,如图10.2所示。 CSR 仅在一维数据存储中存储非零值,如图10.2中的data[]所示。数组data[]存储图 10.1中稀疏矩阵中的所有非零值。这是通过首先存储第 0 行的非零元素(3 和 1),然后是第 1 行的非零元素(无),然后是第 2 行的非零元素(2, 4, 1),然后是第 3 行的非零元素来完成的。 (1, 1)。该格式压缩掉所有零元素。
Sparse matrices are stored in a format that avoids storing zero elements. We will start with the compressed sparse row (CSR) storage format, which is illustrated in Figure 10.2. CSR stores only nonzero values in a 1D data storage, shown as data[] in Figure 10.2. Array data[] stores all the nonzero values in the sparse matrix in Figure 10.1. This is done by storing nonzero elements of row 0 (3 and 1) first, followed by nonzero elements of row 1 (none), followed by nonzero elements of row 2 (2, 4, 1), and then nonzero elements of row 3 (1, 1). The format compresses away all zero elements.
图 10.2 CSR 格式示例。
Figure 10.2 Example of CSR format.
使用压缩格式,我们需要放入两组标记来保留原始稀疏矩阵的结构。第一组标记形成列索引数组,即图 10.2中的col_index[],它给出了原始稀疏矩阵中每个非零值的列索引。由于我们已经挤出了每行的非零元素,因此我们需要使用这些标记来记住剩余元素在稀疏矩阵的原始行中的位置。例如,值 3 和 1 来自原始稀疏矩阵中第 0 行的第 0 列和第 2 列。 col_index [0]和col_index[1]元素被分配来存储这两个元素的列索引。再例如,值 2、4 和 1 来自原始稀疏矩阵中的第 1、2 和 3 列。因此,col_index[0]、col_index[1]和col_index[2]存储索引 1、2 和 3。
With the compressed format, we need to put in two sets of markers to preserve the structure of the original sparse matrix. The first set of markers form a column index array, col_index[] in Figure 10.2, that gives the column index of every nonzero value in the original sparse matrix. Since we have squeezed away nonzero elements of each row, we need to use these markers to remember where the remaining elements were in the original row of the sparse matrix. For example, values 3 and 1 came from columns 0 and 2 of row 0 in the original sparse matrix. The col_index[0] and col_index[1] elements are assigned to store the column indices for these two elements. For another example, values 2, 4, and 1 came from columns 1, 2, and 3 in the original sparse matrix. Therefore, col_index[0], col_index[1], and col_index[2] store indices 1, 2, and 3.
第二组标记给出了压缩存储中每一行的起始位置。这是因为在删除零个元素后,每行的大小变得可变。不再可能使用基于行大小的索引来查找压缩存储中每行的起始位置。在图 10.2中,我们展示了一个row_ptr[]数组,其元素是每行起始位置的指针或索引。即row_ptr[0]表示第 0 行从data[]数组的位置 0 开始,row_ptr[1]表示第 1 行从位置 2 开始,依此类推。注意row_ptr[1]和row_ptr[2]都是2. 这意味着第 1 行的任何元素都没有以压缩格式存储。这是有道理的,因为图 10.1中的第 1 行完全由零值组成。另请注意,row_ptr[5]存储不存在的第 4 行的起始位置。这是为了方便,因为某些算法需要使用下一行的起始位置来描绘当前行的末尾。这个额外的标记提供了一种方便的方法来定位第 3 行的结束位置。
The second set of markers give the starting location of every row in the compressed storage. This is because the size of each row becomes variable after zero elements are removed. It is no longer possible to use indexing based on the row size to find the starting location of each row in the compressed storage. In Figure 10.2, we show a row_ptr[] array of which the elements are the pointers or indices of the beginning locations of each row. That is, row_ptr[0] indicates that row 0 starts at location 0 of the data[] array, row_ptr[1] indicates that row 1 starts at location 2, etc. Note that row_ptr[1] and row_ptr[2] are both 2. This means that none of the elements of the row 1 was stored in the compressed format. This makes sense since row 1 in Figure 10.1 consists entirely of zero values. Note also that row_ptr[5] stores the starting location of a nonexisting row 4. This is for convenience, as some algorithms need to use the starting location of the next row to delineate the end of the current row. This extra marker gives a convenient way to locate the ending location of row 3.
正如我们在第 7 章中讨论的,矩阵通常用于求解 N 个变量的 N 个方程的线性系统,其形式为A×X+Y=0,其中 A 是N × N矩阵,X 是N 个变量的向量,Y 是N 个常数值的向量。目标是求解满足所有问题的 X 变量值。一种直观的方法是对矩阵求逆,使得 X=A −1 ×(−Y)。这可以通过中等大小阵列的高斯消除等方法来完成。虽然理论上可以求解稀疏矩阵表示的方程,但其规模庞大许多稀疏线性方程组的零元素数量可能会压倒这种直观的方法。
As we discussed in Chapter 7, matrices are often used in solving a linear system of N equations of N variables in the form of A×X+Y=0, where A is an N×N matrix, X is a vector of N variables, and Y is a vector of N constant values. The objective is to solve for the X variable values that will satisfy all the questions. An intuitive approach is to Invert the matrix so that X=A−1×(−Y). This can be done through methods such as Gaussian elimination for moderate-size arrays. While it is theoretically possible to solve equations represented in sparse matrices, the sheer size and the number of zero elements of many sparse linear systems of equations can simply overwhelm this intuitive approach.
相反,稀疏线性系统通常可以通过迭代方法更好地解决。当稀疏矩阵 A 为正定时(即,对于R n中的所有非零向量x ,x T Ax >0 ),可以使用共轭梯度方法迭代求解相应的线性系统,并保证收敛到解[Hest1952]。这是通过猜测解 X,执行 A×X+Y,并查看结果是否接近 0 向量来完成的。如果没有,我们可以使用梯度向量公式来细化猜测的 X,并使用细化的 X 执行 A×X+Y 的另一次迭代。这种迭代方法最耗时的部分是对 A×X+ 的评估Y,它是稀疏矩阵向量乘法和累加。图 10.3显示了矩阵向量乘法的一个小示例,其中 A 是稀疏矩阵。 A 中的黑色方块代表非零元素。相反,X 和 Y 都是典型的稠密向量。也就是说,X 和 Y 的大多数元素都持有非零值。由于其重要性,已经创建了标准化库函数接口来以 SpMV(稀疏矩阵向量乘法)的名称执行此操作。我们将使用 SpMV 来说明稀疏计算的不同存储格式之间的重要权衡。
Instead, sparse linear systems can often be better solved with an iterative approach. When the sparse matrix A is positive-definite (i.e., xTAx>0 for all nonzero vectors x in Rn), one can use conjugate gradient methods to iteratively solve the corresponding linear system with guaranteed convergence to a solution [Hest1952]. This is done by guessing a solution X, perform A×X+Y, and see if the result is close to a 0 vector. If not, we can use a gradient vector formula to refine the guessed X and perform another iteration of A×X+Y using the refined X. The most time-consuming part of such an iterative approach is in the evaluation of A×X+Y, which is a sparse matrix–vector multiplication and accumulation. Figure 10.3 shows a small example of matrix-vector multiplication, where A is a sparse matrix. The dark squares in A represent non-zero elements. In contrast, both X and Y are typically dense vectors. That is, most of the elements of X and Y hold non-zero values. Due to its importance, standardized library function interfaces have been created to perform this operation under the name SpMV (sparse matrix–vector multiplication). We will use SpMV to illustrate the important trade-offs between different storage formats of sparse computation.
图 10.3矩阵向量乘法和累加的一个小例子。
Figure 10.3 A small example of matrix–vector multiplication and accumulation.
基于 CSR 的 SpMV 的顺序实现非常简单,如图10.4所示。我们假设代码可以访问 (1) num_rows(一个指定稀疏矩阵中行数的函数参数),以及 (2) 一个浮点data[]数组以及两个整数row_ptr[]和x[]数组如图10.3所示。只有七行代码。第 1 行是一个循环,迭代矩阵的所有行,每次迭代计算当前行和向量x的点积。
A sequential implementation of SpMV based on CSR is quite straightforward, which is shown in Figure 10.4. We assume that the code has access to (1) num_rows, a function argument that specifies the number of rows in the sparse matrix, and (2) a floating-point data[] array and two integer row_ptr[] and x[] arrays as in Figure 10.3. There are only seven lines of code. Line 1 is a loop that iterates through all rows of the matrix, with each iteration calculating a dot product of the current row and the vector x.
图 10.4实现 SpMV 的顺序循环。
Figure 10.4 A sequential loop that implements SpMV.
在每一行中,第 2 行首先将点积初始化为零。然后它设置属于当前行的data[]数组元素的范围。这起始位置和结束位置可以从row_ptr[]数组中加载。图 10.5说明了图 10.1中的小型稀疏矩阵的情况。对于row=0,row_ptr[row]为 0,row_ptr[row+1]为 2。请注意,第 0 行中的两个元素位于data[0]和data[1]中。也就是说,row_ptr[row]给出当前行的起始位置,row_ptr[row+1]给出下一行的起始位置,下一行是当前行结束位置之后的一个。这反映在第 5 行的循环中,其中循环索引从row_ptr[row]给定的位置迭代到row_ptr[row+1]−1。
In each row, line 2 first initializes the dot product to zero. It then sets up the range of data[] array elements that belong to the current row. The starting and ending locations can be loaded from the row_ptr[] array. Figure 10.5 illustrates this for the small sparse matrix in Figure 10.1. For row=0, row_ptr[row] is 0 and row_ptr[row+1] is 2. Note that the two elements from row 0 reside in data[0] and data[1]. That is, row_ptr[row] gives the starting position of the current row and row_ptr[row+1] gives the starting position of the next row, which is one after the ending position of the current row. This is reflected in the loop in line 5, where the loop index iterates from the position given by row_ptr[row] to row_ptr[row+1]−1.
图 10.5 SpMV 循环对图 10.1中的稀疏矩阵进行操作。
Figure 10.5 SpMV loop operating on the sparse matrix in Figure 10.1.
第 6 行中的循环体计算当前行的点积。对于每个元素,它使用循环索引elem来访问data[elem]中的矩阵元素。它还使用elem从col_index[elem]检索元素的列索引。然后使用该列索引来访问适当的x元素以进行乘法。例如,data[0]和data[1]中的元素来自第 0 列 ( col_index[0]=0 ) 和第 2 列 ( col_index[1]=2 )。因此内循环将执行点积第 0 行为data[0]*x[0]+data[1]*x[2]。鼓励读者计算其他行的点积作为练习。
The loop body in line 6 calculates the dot product for the current row. For each element, it uses the loop index elem to access the matrix element in data[elem]. It also uses elem to retrieve the column index for the element from col_index[elem]. This column index is then used to access the appropriate x element for multiplication. For example, the element in data[0] and data[1] are from column 0 (col_index[0]=0) and column 2 (col_index[1]=2). So the inner loop will perform the dot product for row 0 as data[0]∗x[0]+data[1]∗x[2]. Readers are encouraged to work out the dot product for other rows as an exercise.
CSR 从存储中完全删除所有零元素。它确实通过引入col_index和row_ptr数组而产生开销。在我们的小示例中,零元素的数量并不比非零元素多多少,因此开销略大于非零元素所需的空间。
CSR completely removes all zero elements from the storage. It does incur an overhead by introducing the col_index and row_ptr arrays. In our small example, where the number of zero elements are not much more than the non-zero elements, the overhead is slightly more than the space needed for the nonzero elements.
显然,任何 SpMV 代码都会反映假定的存储格式。因此,我们将存储格式添加到代码名称中以明确所使用的组合。我们将图 10.4中的 SpMV 代码称为顺序 SpMV/CSR。充分了解顺序 SpMV/CSR 后,我们现在准备讨论并行稀疏计算。
It should be obvious that any SpMV code will reflect the storage format assumed. Therefore, we will add the storage format to the name of a code to clarify the combination used. We will refer to the SpMV code in Figure 10.4 as sequential SpMV/CSR. With a good understanding of sequential SpMV/CSR, we are now ready to discuss parallel sparse computation.
请注意,稀疏矩阵每一行的点积计算独立于其他行的点积计算。这反映在图 10.4中外循环(第 1 行)的所有迭代在逻辑上是彼此独立的这一事实。通过将外循环的每次迭代分配给一个线程,我们可以轻松地将这个顺序 SpMV/CSR 转换为并行 CUDA 内核,如图 10.6所示,其中线程 0 计算第 0 行的点积,线程 1 计算第 1 行的点积,等等。
Note that the dot product calculation for each row of the sparse matrix is independent of those of other rows. This is reflected in the fact that all iterations of the outer loop (line 1) in Figure 10.4 are logically independent of each other. We can easily convert this sequential SpMV/CSR into a parallel CUDA kernel by assigning each iteration of the outer loop to a thread, which is illustrated in Figure 10.6, where thread 0 calculates the dot product for row 0, thread 1 for row 1, and so on.
图 10.6将线程映射到并行 SpMV/CSR 中的行的示例。
Figure 10.6 Example of mapping threads to rows in parallel SpMV/CSR.
在真正的稀疏矩阵计算中,通常有数千到数百万行,每行包含数十到数百个非零元素。这使得图 10.6中所示的映射看起来非常合适:有很多线程,每个线程都有大量的工作。我们在图 10.7中展示了并行 SpMV/CSR 。
In a real sparse matrix computation, there are usually thousands to millions of rows, each of which contain tens to hundreds of nonzero elements. This makes the mapping shown in Figure 10.6 seem very appropriate: there are many threads and each thread has a substantial amount of work. We show a parallel SpMV/CSR in Figure 10.7.
图 10.7并行 SpMV/CSR 内核。
Figure 10.7 A parallel SpMV/CSR kernel.
应该清楚的是,内核看起来几乎与顺序 SpMV/CSR 循环相同。循环结构已被删除,因为它被线程网格取代。所有其他变化与第 3 章中向量加法内核的情况非常相似。在第 2 行中,行索引的计算方式为熟悉的表达式blockIdx.x ∗ blockDim.x + 线程Idx.x。另外,由于需要处理任意数量的行,第 3 行检查线程的行索引是否超过行数。这处理行数不是线程块大小的倍数的情况。
It should be clear that the kernel looks almost identical to the sequential SpMV/CSR loop. The loop construct has been removed since it is replaced by the thread grid. All the other changes are very similar to the case of the vector addition kernel in Chapter 3. In line 2, the row index is calculated as the familiar expression blockIdx.x ∗ blockDim.x + threadIdx.x. Also, due to the need to handle an arbitrary number of rows, line 3 checks if the row index of a thread exceeds the number of rows. This handles the situation where the number of rows is not a multiple of thread block size.
虽然并行 SpMV/CSR 内核非常简单,但它有两个主要缺点。首先,内核不进行合并内存访问。如果读者检查图 10.5,就会明显看出相邻线程将同时进行不相邻的内存访问。在我们的小示例中,线程 0、1、2 和 3 将在其点积循环的第一次迭代中访问data[0]、 none、data[2]和data[5] 。然后,他们将在第二次迭代中访问data[1]、 none、data[3]和data[6],依此类推。显然,相邻线程进行的这些同时访问并不是针对相邻位置的。因此,并行 SpMV/CSR 内核无法有效利用内存带宽。
While the parallel SpMV/CSR kernel is quite simple, it has two major shortcomings. First the kernel does not make coalesced memory accesses. If readers examine Figure 10.5, it should be obvious that adjacent threads will be making simultaneous nonadjacent memory accesses. In our small example, threads 0, 1, 2, and 3 will access data[0], none, data[2], and data[5] in the first iteration of their dot product loop. They will then access data[1], none, data[3], and data[6] in the second iterations, and so on. It is obvious that these simultaneous accesses made by adjacent threads are not to adjacent locations. As a result, the parallel SpMV/CSR kernel does not make efficient use of memory bandwidth.
SpMV/CSR 内核的第二个缺点是它可能在所有扭曲中存在显着的控制流分歧。点积循环中线程所进行的迭代次数取决于分配给该线程的行中非零元素的数量。由于行间非零元素的分布可以是随机的,因此相邻行可以具有非常不同数量的非零元素。因此,在大多数甚至所有扭曲中可能存在广泛的控制流发散。
The second shortcoming of the SpMV/CSR kernel is that it can potentially have significant control flow divergence in all warps. The number of iterations taken by a thread in the dot product loop depends on the number of nonzero elements in the row assigned to the thread. Since the distribution of nonzero elements among rows can be random, adjacent rows can have a very different number of nonzero elements. As a result, there can be widespread control flow divergence in most or even all warps.
应该清楚的是,并行SpMV内核的执行效率和内存带宽效率都取决于输入数据矩阵的分布。这与我们迄今为止介绍的大多数内核有很大不同。然而,这种依赖于数据的性能行为在实际应用中非常常见。这是其中之一并行 SpMV 如此重要的并行模式的原因在于它很简单,但它说明了许多复杂并行应用程序中的重要行为。我们将在下一节中讨论解决并行 SpMV/CSR 内核的两个缺点的重要技术。
It should be clear that both the execution efficiency and memory bandwidth efficiency of the parallel SpMV kernel depends on the distribution of the input data matrix. This is quite different from most of the kernels we have presented so far. However, such data-dependent performance behavior is quite common in real-world applications. This is one of the reasons why parallel SpMV is such an important parallel pattern—it is simple and yet it illustrates an important behavior in many complex parallel applications. We will discuss the important techniques in the next sections to address the two shortcomings of the parallel SpMV/CSR kernel.
非合并内存访问和控制发散的问题可以通过数据填充和矩阵布局转置来解决。这些想法被用在 ELL 存储格式中,其名称来自于稀疏矩阵包 ELLPACK。理解 ELL 格式的一个简单方法是从 CSR 格式开始,如图10.8所示。
The problems of noncoalesced memory accesses and control divergence can be addressed with data padding and transposition of matrix layout. The ideas were used in the ELL storage format, the name of which came from the sparse matrix package ELLPACK. A simple way to understand the ELL format is to start with the CSR format, as illustrated in Figure 10.8.
图10.8 ELL存储格式。
Figure 10.8 ELL storage format.
根据 CSR 表示,我们首先确定具有最大非零元素数的行。然后,我们将虚拟(零)元素添加到非零元素之后的所有其他行,以使它们与最大行的长度相同。这使得矩阵成为矩形矩阵。对于我们的小型稀疏矩阵示例,我们确定第 2 行具有最大元素数。然后,我们向第 0 行添加 1 个零元素,向第 1 行添加 3 个零元素,向第 3 行添加 1 个零元素,以使它们的长度相同。这些额外的零元素在图 10.8中显示为带有 * 的正方形。现在矩阵变成了矩形矩阵。请注意,col_index数组也需要以相同的方式进行填充,以保留它们与数据值的对应关系。
From a CSR representation, we first determine the rows with the maximal number of nonzero elements. We then add dummy (zero) elements to all other rows after the nonzero elements to make them the same length as the maximal rows. This makes the matrix a rectangular matrix. For our small sparse matrix example, we determine that row 2 has the maximal number of elements. We then add one zero element to row 0, three zero elements to row 1, and one zero element to row 3 to make all them the same length. These additional zero elements are shown as squares with an ∗ in Figure 10.8. Now the matrix has become a rectangular matrix. Note that the col_index array also needs to be padded the same way to preserve their correspondence to the data values.
我们现在可以按列优先顺序布置填充矩阵。也就是说,我们将把第 0 列的所有元素放在连续的位置,然后是第 1 列的所有元素,依此类推。这相当于转置矩形矩阵并按行优先顺序布局矩阵C. 就我们的小例子而言,转置后,data[0]到data[3]现在包含 3, *, 2, 1,即所有行的零元素。图 10.9的底部对此进行了说明。col_index[0]到col_index[3]包含所有行的第一个元素。请注意,我们不再需要row_ptr,因为第i行的开头现在只是data[i]。使用填充元素,只需将行数添加到索引,即可轻松从第 i行的当前元素移动到下一个元素。例如,第 2 行的零元素位于data[2]中,下一个元素位于data[2+4]=data[6]中,其中 4 是我们的小示例中的行数。
We can now lay out the padded matrix in column-major order. That is, we will place all elements of column 0 in consecutive locations, followed by all elements of column 1, and so on. This is equivalent to transposing the rectangular matrix and layout out the matrix in the row-major order of C. In terms of our small example, after the transposition, data[0] through data[3] now contain 3, ∗, 2, 1, the zero elements of all rows. This is illustrated in the bottom portion of Figure 10.9. col_index[0] through col_index[3] contain the first elements of all rows. Note that we no longer need the row_ptr since the beginning of row i is now simply data[i]. With the padded elements, it is also very easy to move from the current element of row i to the next element by simply adding the number of rows to the index. For example, the zero element of row 2 is in data[2] and the next element is in data[2+4]=data[6], where 4 is the number of rows in our small example.
图 10.9我们在 ELL 中的小示例的更多细节。
Figure 10.9 More details of our small example in ELL.
使用 ELL 格式,我们在图 10.10中展示了并行 SpMV/ELL 内核。内核收到的参数略有不同。它不再需要row_ptr。相反,它需要一个参数num_elem来了解填充后每行中的元素数量。
Using the ELL format, we show a parallel SpMV/ELL kernel in Figure 10.10. The kernel receives a slightly different argument. It no longer needs the row_ptr. Instead, it needs an argument num_elem to know the number of elements in each row after padding.
图 10.10并行 SpMV/ELL 内核。
Figure 10.10 A parallel SpMV/ELL kernel.
第一个观察结果是 SpMV/ELL 内核代码比 SpMV/CSR 更简单。通过填充,所有行现在都具有相同的长度。在第 5 行的点积循环中,所有线程都可以简单地循环num_elem给出的元素数量。因此,扭曲中不再存在控制流分歧:所有线程现在在点积循环中迭代完全相同的次数。在乘法中使用虚拟元素的情况下,由于其值为零,因此不会影响最终结果。
A first observation is that the SpMV/ELL kernel code is simpler than SpMV/CSR. With padding, all rows are now of the same length. In the dot product loop in line 5, all threads can simply loop through the number of elements given by num_elem. As a result, there is no longer control flow divergence in warps: all threads now iterate exactly the same number of times in the dot product loop. In the case where a dummy element is used in the multiplication, since its value is zero, it will not affect the final result.
第二个观察结果是,在点积循环体中,每个线程访问data[row]中的零元素,然后访问data[row+i*num_rows]中的i元素。正如我们在图 10.10中看到的,所有相邻线程现在都在访问相邻内存位置,从而实现内存合并,从而更有效地利用内存带宽。
A second observation is that in the dot product loop body, each thread accesses its zero element in data[row] and then access its i element in data[row+i∗num_rows]. As we have seen in Figure 10.10, all adjacent threads are now accessing adjacent memory locations, enabling memory coalescing and thus making more efficient use of memory bandwidth.
通过消除控制流分歧并启用内存合并,SpMV/ELL 应比 SpMV/CSR 运行得更快。此外,SpMV/ELL 更简单。这似乎使 SpMV/ELL 成为一种双赢的方法。然而,事实并非如此。不幸的是,它确实有其潜在的缺点。在一行或少数行具有大量非零元素的情况下,ELL 格式将导致过多的填充元素。这些填充的元素将占用存储空间,需要获取并参与计算,即使它们对最终结果没有贡献。如果有足够的填充元素,SpMV/ELL 内核实际上会比 SpMV/CSR 内核运行得更慢。这需要一种方法来控制 ELL 表示中填充元素的数量。
By eliminating control flow divergence and enabling memory coalescing, SpMV/ELL should run faster than SpMV/CSR. Furthermore, SpMV/ELL is simpler. This seems to make SpMV/ELL an all-winning approach. However, it is not quite so. Unfortunately, it does have its potential downside. In situations where one or a small number of rows have an exceedingly large number of nonzero elements, the ELL format will result in an excessive number of padded elements. These padded elements will take up storage, need to be fetched, and take part in calculations even though they do not contribute to the final result. With enough padded elements, an SpMV/ELL kernel can actually run more slowly than an SpMV/CSR kernel. This calls for a method to control the number of padded elements in an ELL representation.
ELL 中填充过多问题的根源在于一行或少量行具有大量非零元素。如果我们有一种机制从这些行中“拿走”一些元素,我们就可以减少 ELL 中填充元素的数量。坐标(COO)格式提供了这样的场所。
The root of the problem with excessive padding in ELL is that one or a small number of rows have an exceedingly large number of nonzero elements. If we have a mechanism to “take away” some elements from these rows, we can reduce the number of padded elements in ELL. The coordinate (COO) format provides such a venue.
COO 格式如图10.11所示,其中每个非零元素都与其列索引和行索引一起存储。我们有col_index和row_index数组来配合数据数组。例如,我们的小示例中的A[0,0]现在与其列索引( col_index[0]中的 0 )和行索引(row_index[0]中的 0)一起存储。使用 COO 格式,人们可以查看存储中的任何元素并知道其位置非零元素来自原始稀疏矩阵。与 ELL 格式一样,不需要row_ptr,因为每个元素自行标识其列和行索引。
The COO format is illustrated in Figure 10.11, where each nonzero element is stored with both its column index and row index. We have both col_index and row_index arrays to accompany the data array. For example A[0,0] of our small example is now stored with both its column index (0 in col_index[0]) and its row index (0 in row_index[0]). With COO format, one can look at any element in the storage and know where the nonzero element came from in the original sparse matrix. Like the ELL format, there is no need for row_ptr since each element self-identifies its column and row index.
图 10.11 COO 格式示例。
Figure 10.11 Example of COO format.
虽然 COO 格式确实会带来row_index数组的额外存储成本,但它还具有灵活性的额外好处。只要我们以相同的方式对data、col_index和row_index进行重新排序,我们就可以任意对 COO 格式的元素进行重新排序,而不会丢失任何信息。我们使用图 10.12中的小示例对此进行了说明。
While the COO format does come with the cost of additional storage for the row_index array, it also comes with the additional benefit of flexibility. We can arbitrarily reorder the elements in a COO format without losing any information as long as we reorder the data, col_index, and row_index the same way. This is illustrated using our small example in Figure 10.12.
图 10.12重新排序 COO 格式。
Figure 10.12 Reordering COO format.
在图 10.12中,我们重新排序了data、col_index和row_index的元素。现在data[0]实际上包含小型稀疏矩阵第 0 行和第 3 列的元素。因为我们还随数据值一起移动了行索引和列索引值,所以我们可以正确识别原始稀疏矩阵中的该元素。读者可能会问为什么我们要对这些元素重新排序。这种重新排序会扰乱对于有效使用内存带宽非常重要的局部性和顺序模式。
In Figure 10.12, we have reordered the elements of data, col_index, and row_index. Now data[0] actually contains an element from row 0 and column 3 of the small sparse matrix. Because we have also shifted the row index and column index values along with the data value, we can correctly identify this element in the original sparse matrix. Readers may ask why we would want to reorder these elements. Such reordering would disturb the locality and sequential patterns that are important for efficient use of memory bandwidth.
答案在于 COO 格式的一个重要用例。它可用于限制 CSR 或 ELL 格式的长度。首先,我们将进行一个重要的观察。在 COO 格式中,我们可以按照我们想要的任何顺序处理元素。对于data[i]中的每个元素,我们可以简单地执行y[row_index[i]] += data[i] ∗ x[col_index[i]]操作。如果我们确保以某种方式对所有数据元素执行此操作,我们将计算出正确的最终答案。
The answer lies in an important use case for the COO format. It can be used to curb the length of the CSR or ELL format. First, we will make an important observation. In the COO format, we can process the elements in any order we want. For each element in data[i], we can simply perform a y[row_index[i]] += data[i] ∗ x[col_index[i]] operation. If we make sure somehow we perform this operation for all elements of data, we will calculate the correct final answer.
更重要的是,我们可以从非零元素数量过多的行中取出一些元素,并将它们放入单独的 COO 格式中。我们可以使用 CSR 或 ELL 对其余元素执行 SpMV。通过从超长行中删除多余的元素,可以显着减少其他行的填充元素的数量。然后我们可以使用 SpMV/COO 来完成工作。这种采用两种格式协作完成计算的方法通常称为混合方法。
More importantly, we can take away some of the elements from the rows with an exceedingly large number of nonzero elements and place them into a separate COO format. We can use either CSR or ELL to perform SpMV on the remaining elements. With excess elements removed from the extra-long rows, the number of padded elements for other rows can be significantly reduced. We can then use a SpMV/COO to finish the job. This approach of employing two formats to collaboratively complete a computation is often referred to as a hybrid method.
让我们使用小型稀疏矩阵来说明 SpMV 的混合 ELL 和 COO 方法,如图10.13所示。我们看到第 2 行的非零元素数量最多。我们从 ELL 表示中删除第 2 行的最后一个非零元素,并将其移至单独的 COO 表示中。通过删除第 2 行的最后一个元素,我们将小型稀疏矩阵中所有行中非零元素的最大数量从 3 个减少到 2 个。如图10.13所示,我们将填充元素的数量从 5 个减少到 2 个。更重要的是,所有线程现在只需要进行两次迭代,而不是三次。这可以使 SpMV/ELL 内核的并行执行速度提高 50%。
Let’s illustrate a hybrid ELL and COO method for SpMV using our small sparse matrix, as shown in Figure 10.13. We see that row 2 has the most number of nonzero elements. We remove the last nonzero element of row 2 from the ELL representation and move it into a separate COO representation. By removing the last element of row 2, we reduce the maximal number of nonzero elements among all rows in the small sparse matrix from 3 to 2. As shown in Figure 10.13, we reduce the number of padded elements from 5 to 2. More importantly, all threads now only need to take two iterations rather than three. This can give a 50% acceleration to the parallel execution of the SpMV/ELL kernel.
图 10.13我们的 ELL 和 COO 混合的小例子。
Figure 10.13 Our small example in ELL and COO hybrid.
使用 ELL-COO 混合方法的典型方法是主机将格式从 CSR 格式转换为 ELL。在转换期间,主机从具有大量非零元素的行中删除一些非零元素。主机将这些元素放入 COO 表示中。然后主机将数据的 ELL 表示形式传输到设备。当设备完成 SpMV/ELL 内核时,它将y值传输回主机。这些值缺少 COO 表示中元素的贡献。主机对 COO 元素执行顺序 SpMV/COO,并完成它们对y元素值的贡献。
A typical way of using an ELL-COO hybrid method is for the host to convert the format from something like a CSR format into ELL. During the conversion, the host removes some nonzero elements from the rows with an exceedingly large number of nonzero elements. The host places these elements into a COO representation. The host then transfers the ELL representation of the data to a device. When the device completes the SpMV/ELL kernel, it transfers the y values back to the host. These values are missing the contributions from the elements in the COO representation. The host performs a sequential SpMV/COO on the COO elements and finishes their contributions to the y element values.
用户可能会质疑主机为将 COO 元素与 ELL 格式分离而完成的额外工作是否会产生过多的开销。答案是视情况而定。在稀疏矩阵仅用于一次 SpMV 计算的情况下,这项额外工作确实会产生大量开销。然而,在许多实际应用中,SpMV 是在迭代求解器中重复的同一稀疏内核上执行的。在求解器的每次迭代中,x和y向量会发生变化,但稀疏矩阵保持不变,因为其元素对应于正在求解的线性方程组的系数,并且这些系数在迭代之间不会发生变化。因此,生成 ELL 和 COO 表示形式所做的工作可以在多次迭代中摊销。我们将在下一节中回到这一点。
The user may question whether the additional work done by the host to separate COO elements from an ELL format could incur too much overhead. The answer is it depends. In situations where a sparse matrix is only used in one SpMV calculation, this extra work can indeed incur significant overhead. However, in many real-world applications, the SpMV is performed on the same sparse kernel repeated in an iterative solver. In each iteration of the solver, the x and y vectors vary but the sparse matrix remains the same since its elements correspond to the coefficients of the linear system of equations being solved and these coefficients do not change from iteration to iteration. So, the work done to produce both the ELL and COO representation can be amortized across many iterations. We will come back to this point in the next section.
在我们的小示例中,设备在数据的 ELL 部分上完成 SpMV/ELL 内核。然后y值被传回主机。然后,主机通过操作y[2] += data[0] * x[col_index[0]] = 1*x[3]添加 COO 元素的贡献。请注意,COO 格式中通常有多个非零元素。因此,我们期望主机代码是一个循环,如图10.14所示。
In our small example, the device finishes the SpMV/ELL kernel on the ELL portion of the data. The y values are then transferred back to the host. The host then adds the contribution of the COO element with the operation y[2] += data[0] ∗ x[col_index[0]] = 1∗x[3]. Note that there are in general multiple nonzero elements in the COO format. So, we expect that the host code to be a loop as shown in Figure 10.14.
图 10.14实现 SpMV/COO 的顺序循环。
Figure 10.14 A sequential loop that implements SpMV/COO.
循环非常简单。它迭代所有数据元素,并使用随附的col_index和row_index元素对适当的x和y元素执行乘法和累加运算。我们不会提供并行 SpMV/COO 内核。使用每个线程处理一部分数据元素并使用原子操作将结果累积到y 个元素中,可以轻松构建它。这是因为线程不再映射到特定行。事实上,COO 表示中可能会丢失许多行;只有具有大量非零元素的行才会具有 COO 表示形式中的元素。因此,最好让每个线程获取一部分数据元素并使用原子操作来确保没有任何线程会践踏其他线程的贡献。
The loop is extremely simple. It iterates through all the data elements and performs the multiply and accumulate operations on the appropriate x and y elements using the accompanying col_index and row_index elements. We will not present a parallel SpMV/COO kernel. It can be easily constructed using each thread to process a portion of the data elements and use an atomic operation to accumulate the results into y elements. This is because the threads are no longer mapped to a particular row. In fact, many rows will likely be missing from the COO representation; only the rows that have an exceedingly large number of nonzero elements will have elements in the COO representation. Therefore, it is better just to have each thread to take a portion of the data element and use an atomic operation to make sure that none of the threads will trample the contribution of other threads.
混合 SpMV/ELL-COO 方法很好地说明了在异构计算系统中高效使用 CPU 和 GPU。 CPU 可以利用其大容量高速缓冲存储器快速执行 SpMV/COO。 GPU 可以利用其合并的内存访问和大量硬件执行单元快速执行 SpMV/ELL。从 ELL 格式中删除一些元素是一种正则化技术:它减少了长行和短行之间的差异,并使所有线程的工作负载更加均匀。这种改进的均匀性带来了一些好处,例如 SpMV/CSR 内核中的控制发散较小或 SpMV/ELL 内核中的填充较少。
The hybrid SpMV/ELL-COO method is a good illustration of productive use of both CPUs and GPUs in a heterogeneous computing system. The CPU can perform SpMV/COO fast using its large cache memory. The GPU can perform SpMV/ELL fast using its coalesced memory accesses and large number of hardware execution units. The removal of some elements from the ELL format is a form of regularization technique: it reduces the disparity between long and short rows and makes the workload of all threads more uniform. Such improved uniformity results in benefits such as less control divergence in a SpMV/CSR kernel or less padding in a SpMV/ELL kernel.
虽然 COO 有助于调节 ELL 表示中的填充量,但我们可以通过对稀疏矩阵的行进行排序和分区来进一步减少填充开销。这个想法是根据行的长度对行进行排序,例如从最长到最短。图 10.15中的小型稀疏矩阵对此进行了说明。由于排序矩阵看起来很像三角矩阵,因此该格式通常称为锯齿状对角存储(JDS)。当我们对行进行排序时,我们通常会维护一个额外的jds_row_index数组来保留行的原始索引。对于 CSR,这与row_ptr数组类似,每行有一个元素。每当我们在排序过程中交换两行时,我们也会交换jds_row_index数组的相应元素。这样,我们就可以始终跟踪所有行的原始位置。
While COO helps to regulate the amount of padding in an ELL representation, we can further reduce the padding overhead by sorting and partitioning the rows of a sparse matrix. The idea is to sort the rows according to their length, say from the longest to the shortest. This is illustrated with our small sparse matrix in Figure 10.15. Since the sorted matrix looks largely like a triangular matrix, the format is often referred to as jagged diagonal storage (JDS). As we sort the rows, we typically maintain an additional jds_row_index array that preserves the original index of the row. For CSR, this is similar to the row_ptr array in that there is one element per row. Whenever we exchange two rows in the sorting process, we also exchange the corresponding elements of the jds_row_index array. This way, we can always keep track of the original position of all rows.
图 10.15根据行的长度对行进行排序。
Figure 10.15 Sorting rows according to their length.
一旦稀疏矩阵采用 JDS 格式,我们就可以将矩阵划分为行部分。由于行已排序,因此节中的所有行可能具有或多或少统一数量的非零元素。在如图10.15所示,我们可以将小矩阵分为三部分:第一部分由一行有三个元素的行组成,第二部分由两行各两个元素组成,第三部分由一行无任何元素组成。然后我们可以为每个部分生成 ELL 表示。在每个部分中,我们只需要填充行以使该行与该部分中的最大元素数相匹配。这将减少填充元素的数量。在我们的示例中,我们甚至不需要在三个部分中的任何一个内填充。然后,我们可以独立地转置每个部分,并在每个部分上启动单独的内核。事实上,我们甚至不需要为没有非零元素的行部分启动内核。
Once a sparse matrix is in JDS format, we can partition the matrix into sections of rows. Since the rows have been sorted, all rows in a section will likely have a more or less uniform number of nonzero elements. In Figure 10.15, we can divide the small matrix into three sections: the first section consists of the one row that has three elements, the second section consists of the two rows with two elements each, and the third section consists of one row without any element. We can then generate ELL representation for each section. Within each section, we only need to pad the rows to match the row with the maximal number of elements in that section. This would reduce the number of padded elements. In our example, we do not even need to pad within any of the three sections. We can then transpose each section independently and launch a separate kernel on each section. In fact, we do not even need to launch a kernel for the section of rows with no nonzero elements.
图 10.16显示了我们的小型稀疏矩阵的 JDS-ELL 表示。它假设排序和划分结果与图 10.15所示相同。在三个部分中,第一部分只有一行,因此转置后的布局与原始布局相同。第二部分是一个 2 × 2 矩阵并且已被转置。第三部分由第 1 行组成,该行没有任何非零元素。这体现在它的起始位置和下一段的起始位置是相同的。
Figure 10.16 shows a JDS-ELL representation of our small sparse matrix. It assumed the same sorting and partitioning results shown in Figure 10.15. Out of the three sections, the first section has only one row so the transposed layout is the same as the original. The second section is a 2 × 2 matrix and has been transposed. The third section consists of row 1, which does not have any nonzero element. This is reflected in the fact that its starting location and the next section’s starting position are identical.
图 10.16 JDS 格式和分段 ELL。
Figure 10.16 JDS format and sectioned ELL.
我们不会展示 SpMV/JDS 内核。原因是我们将在 CSR 的每个部分上使用 SpMV/CSR 内核,或者在填充后在 ELL 的每个部分上使用 SpMV/ELL 内核。创建 JDS 表示以及在 JDS 表示的每个部分上启动 SpMV 内核所需的主机代码留作练习。
We will not show a SpMV/JDS kernel. The reason is that we would be just using either an SpMV/CSR kernel on each section of the CSR, or a SpMV/ELL kernel on each section of the ELL after padding. The host code required to create a JDS representation and to launch SpMV kernels on each section of the JDS representation is left as an exercise.
请注意,我们希望每个部分都有大量的行,以便其内核启动是值得的。在极少量的行具有大量非零元素的极端情况下,我们仍然可以使用 COO 与 JDS 的混合,以允许我们在每个部分中拥有更多的行。
Note that we want each section to have a large number of rows so that its kernel launch will be worthwhile. In the extreme cases where a very small number of rows have an extremely large number of nonzero elements, we can still use the COO hybrid with JDS to allow us to have more rows in each section.
读者应该再次询问,对行进行排序是否会导致线性方程组的解不正确。回想一下,我们可以自由地重新排序线性系统的方程而不改变解。只要我们对y元素和行进行重新排序,我们就有效地对方程进行了重新排序。因此,我们最终会得到正确的解决方案。唯一的额外步骤是使用jds_row_index数组将最终解决方案重新排序回原始顺序。
Once again readers should ask whether sorting rows will result into incorrect solutions to the linear system of equations. Recall that we can freely reorder equations of a linear system without changing the solution. As long as we reorder the y elements along with the rows, we are effectively reordering the equations. Therefore, we will end up with the correct solution. The only extra step is to reorder the final solution back to the original order using the jds_row_index array.
另一个问题是排序是否会产生大量开销。答案与我们在混合方法中看到的类似。只要在迭代求解器中使用 SpMV/JDS 内核,就可以执行此类排序以及最终解x元素的重新排序,并在求解器的多次迭代中分摊成本。
The other question is whether sorting will incur significant overhead. The answer is similar to what we saw in the hybrid method. As long as the SpMV/JDS kernel is used in an iterative solver, one can afford to perform such sorting as well as the reordering of the final solution x elements and amortize the cost among many iterations of the solver.
在更新的设备中,内存合并硬件放宽了地址对齐要求。这使得人们可以简单地转置 JDS-CSR 表示形式。请注意,转置后我们确实需要调整jds_section_ptr数组。这进一步消除了在每个部分中填充行的需要。随着内存带宽日益成为性能的限制因素,消除存储和获取填充元素的需要可能是一个显着的优势。事实上,我们观察到,虽然分段 JDS-ELL 往往在较旧的 CUDA 设备上提供最佳性能,但转置的 JDS-CSR 往往在 Fermi 和 Kepler 上提供最佳性能。
In more recent devices, the memory coalescing hardware has relaxed the address alignment requirement. This allows one to simply transpose a JDS-CSR representation. Note that we do need to adjust the jds_section_ptr array after transposition. This further eliminates the need to pad rows in each section. As memory bandwidth becomes increasingly the limiting factor of performance, eliminating the need to store and fetch padded elements can be a significant advantage. Indeed, we have observed that while sectioned JDS-ELL tends to give the best performance on older CUDA devices, transposed JDS-CSR tends to give the best performance on Fermi and Kepler.
与稠密矩阵计算相比,我们想对稀疏矩阵计算的性能进行补充说明。一般来说,CPU 或 GPU 实现的稀疏矩阵计算的 FLOPS 等级比密集矩阵计算低得多。对于 SpMV 来说尤其如此,因为稀疏矩阵中没有数据重用。 CGMA 值(参见第 5 章)本质上是 1,将可实现的 FLOPS 率限制为峰值性能的一小部分。各种格式对于 CPU 和 GPU 都很重要,因为两者在执行 SpMV 时都受到内存带宽的限制。过去,许多人对 CPU 和 GPU 上此类计算的低 FLOPS 评级感到惊讶。读完这一章,人们应该不再感到惊讶。
We would like to make an additional remark on the performance of sparse matrix computation as compared to dense matrix computation. In general, the FLOPS rating achieved by either CPUs or GPUs are much lower for sparse matrix computation than for dense matrix computation. This is especially true for SpMV, where there is no data reuse in the sparse matrix. The CGMA value (see Chapter 5) is essentially 1, limiting the achievable FLOPS rate to a small fraction of the peak performance. The various formats are important for both CPUs and GPUs since both are limited by memory bandwidth when performing SpMV. Many folks have been surprised by the low FLOPS rating of this type of computation on both CPUs and GPUs in the past. After reading this chapter, one should no longer be surprised.
在本章中,我们将稀疏矩阵计算作为一种重要的并行模式提出。稀疏矩阵在许多现实世界中都很重要涉及复杂现象建模的应用程序。此外,稀疏矩阵计算是许多大型实际应用程序依赖于数据的性能行为的一个简单示例。由于零元素数量庞大,因此使用压缩技术来减少对这些零元素执行的存储量、内存访问量和计算量。与本书迄今为止介绍的大多数其他内核不同,SpMV 内核对稀疏矩阵中非零元素的分布很敏感。不仅每个内核的性能在不同矩阵之间存在显着差异,而且它们的相对优点也可能发生显着变化。使用这种模式,我们引入了使用混合方法和排序/分区的正则化概念。这些正则化方法用于许多实际应用中。有趣的是,一些正则化技术将零元素重新引入到压缩表示中。我们使用混合方法来缓解可能引入过多零元素的病态情况。读者可以参考[Bell2009],并鼓励他们尝试不同的稀疏数据集,以更深入地了解本章中介绍的各种 SpMV 内核的数据依赖性能行为。
In this chapter, we presented sparse matrix computation as an important parallel pattern. Sparse matrices are important in many real-world applications that involve modeling complex phenomenon. Furthermore, sparse matrix computation is a simple example of data-dependent performance behavior of many large real-world applications. Due to the large amount of zero elements, compaction techniques are used to reduce the amount of storage, memory accesses, and computation performed on these zero elements. Unlike most other kernels presented in this book so far, the SpMV kernels are sensitive to the distribution of nonzero elements in the sparse matrices. Not only can the performance of each kernel vary significantly across matrices, their relative merit can also change significantly. Using this pattern, we introduce the concept of regularization using hybrid methods and sorting/partitioning. These regularization methods are used in many real-world applications. Interestingly, some of the regularization techniques reintroduce zero elements into the compacted representations. We use hybrid methods to mitigate the pathological cases where we could introduce too many zero elements. Readers are referred to [Bell2009] and encouraged to experiment with different sparse data sets to gain more insight into the data-dependent performance behavior of the various SpMV kernels presented in this chapter.
10.1. 完成主机代码以生成混合 ELL-COO 格式,在设备上启动 ELL 内核,并完成 COO 元素的贡献。
10.1. Complete the host code to produce the hybrid ELL-COO format, launch the ELL kernel on the device, and complete the contributions of the COO elements.
10.2. 完成用于生成 JDS-ELL 的主机代码,并为表示的每一部分启动一个内核。
10.2. Complete the host code for producing JDS-ELL and launch one kernel for each section of the representation.
10.3. 考虑以下稀疏矩阵:
1 0 7 0
0 0 8 0
0 4 3 0
2 0 0 1
采用以下格式表示:(a) COO、(b) CSR 和 (c) ELL。
10.3. Consider the following sparse matrix:
1 0 7 0
0 0 8 0
0 4 3 0
2 0 0 1
Represent it in each of the following formats: (a) COO, (b) CSR, and (c) ELL.
10.4. 给定一个由m行、n列和z 个非零值组成的稀疏整数矩阵,需要多少个整数来表示 (a) COO、(b) CSR 和 (c) ELL 中的矩阵。如果提供的信息不充分,请指出缺少哪些信息。
10.4. Given a sparse matrix of integers with m rows, n columns, and z nonzeros, how many integers are needed to represent the matrix in (a) COO, (b) CSR, and (c) ELL. If the information provided is not enough, indicate what information is missing.
1. Hestenes M,Stiefel E。求解线性系统的共轭梯度方法。国家标准局研究杂志。 1952;49。
1. Hestenes M, Stiefel E. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards. 1952;49.
2. Bell, N. 和 Garland, M. 在面向吞吐量的处理器上实现稀疏矩阵向量乘法,ACM 高性能计算网络存储和分析会议记录 (SC'09),2009 年。
2. Bell, N., & Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the ACM Conference on High-Performance Computing Networking Storage and Analysis (SC’09), 2009.
11.1 应用背景
11.1 Application Background
11.2 迭代重建
11.2 Iterative Reconstruction
11.3 计算FHD
11.3 Computing FHD
11.4 最终评估
11.4 Final Evaluation
11.5 练习
11.5 Exercises
应用案例研究以具体的方式教授计算思维和实用编程技术。它们还有助于演示各个技术如何适应自上而下的开发过程。最重要的是,它们帮助我们可视化这些技术在解决问题中的实际应用。在本章中,我们从一个相对简单的应用程序的背景和问题表述开始。我们表明,并行执行不仅可以加快现有方法的速度,而且还允许应用程序专家追求已知可带来好处但之前由于计算要求过高而被忽视的方法。然后,我们使用示例算法及其实现源代码来说明开发人员如何系统地确定内核并行结构、将变量分配到 CUDA 内存中、绕过硬件限制、验证结果并评估性能影响改进。
Application case studies teach computational thinking and practical programming techniques in a concrete manner. They also help demonstrate how the individual techniques fit into a top-to-bottom development process. Most importantly, they help us to visualize the practical use of these techniques in solving problems. In this chapter, we start with the background and problem formulation of a relatively simple application. We show that parallel execution not only speeds up the existing approaches, but also allows applications experts to pursue approaches that are known to provide benefit but were previously ignored due to their excessive computational requirements. We then use an example algorithm and its implementation source code from such an approach to illustrate how a developer can systematically determine the kernel parallelism structure, assign variables into CUDA memories, steer around limitations of the hardware, validate results, and assess the impact of performance improvements.
医学界通常使用磁共振成像 (MRI) 来安全、无创地探测身体所有区域的生物组织的结构和功能。使用 MRI 生成的图像对临床和研究环境产生了深远的影响。 MRI 包括两个阶段:采集(扫描)和重建。在采集阶段期间,扫描仪沿着预定义的轨迹对k空间域(即空间频率域或傅立叶变换域)中的数据进行采样。然后在重建阶段将这些样本转换为所需的图像。
Magnetic resonance imaging (MRI) is commonly used by the medical community to safely and noninvasively probe the structure and function of biological tissues in all regions of the body. Images that are generated using MRI have made profound impact in both clinical and research settings. MRI consists of two phases: acquisition (scan) and reconstruction. During the acquisition phase, the scanner samples data in the k-space domain (i.e., the spatial-frequency domain or Fourier transform domain) along a predefined trajectory. These samples are then transformed into the desired image during the reconstruction phase.
MRI 的应用通常受到高噪声水平、显着的成像伪影和/或长数据采集时间的限制。在临床环境中,短扫描时间不仅可以提高扫描仪的吞吐量,还可以减少患者的不适,从而减轻与运动相关的伪影。高图像分辨率和保真度非常重要,因为它们可以更早地检测病理,从而改善患者的预后。然而,短扫描时间、高分辨率和高信噪比 (SNR) 的目标经常发生冲突;一项指标的改进往往会以牺牲其他一项或两项指标为代价。人们需要新的技术突破才能在所有三个维度上同时改进。这项研究展示了大规模并行计算提供此类突破的案例。
The application of MRI is often limited by high noise levels, significant imaging artifacts, and/or long data acquisition times. In clinical settings, short scan times not only increase scanner throughput but also reduce patient discomfort, which tends to mitigate motion-related artifacts. High image resolution and fidelity are important because they enable earlier detection of pathology, leading to improved prognoses for patients. However, the goals of short scan time, high resolution, and high signal-to-noise ratio (SNR) often conflict; improvements in one metric tend to come at the expense of one or both of the others. One needs new technological breakthroughs to be able to simultaneously improve on all of three dimensions. This study presents a case where massively parallel computing provides such a breakthrough.
读者可以参考Liang 和 Lauterbur [LL1999]等 MRI 教科书来了解 MRI 背后的物理原理。对于本案例研究,我们将重点关注重建阶段的计算复杂性以及复杂性如何受到k空间采样轨迹的影响。 MRI扫描仪使用的k空间采样轨迹可以显着影响重建图像的质量、重建算法的时间复杂度以及扫描仪获取样本所需的时间。下面的方程(11.1)显示了将k空间样本与一类重建方法的重建图像相关联的公式。
Readers are referred to MRI textbooks such as Liang and Lauterbur [LL1999] for the physics principles behind MRI. For this case study, we will focus on the computational complexity in the reconstruction phase and how the complexity is affected by the k-space sampling trajectory. The k-space sampling trajectory used by the MRI scanner can significantly affect the quality of the reconstructed image, the time complexity of the reconstruction algorithm, and the time required for the scanner to acquire the samples. Equation (11.1) below shows a formulation that relates the k-space samples to the reconstructed image for a class of reconstruction methods.
(11.1)
(11.1)
在等式中。 (11.1) 中,m ( r )是重建图像,s ( k )是测量的k空间数据,W ( k )是考虑非均匀采样的加权函数。也就是说,W ( k )减少了来自k空间区域的数据的影响,在k空间区域中采样点的密度较高。对于此类重建,W ( k ) 还可以用作变迹函数,以减少噪声的影响并减少由于有限采样而产生的伪影。
In Eq. (11.1), m(r) is the reconstructed image, s(k) is the measured k-space data, and W(k) is the weighting function that accounts for nonuniform sampling. That is, W(k) decreases the influence of data from k-space regions where a higher density of sample points are taken. For this class of reconstructions, W(k) can also serve as an apodization function that reduces the influence of noise and reduces artifacts due to finite sampling.
如果在理想条件下在k空间中均匀间隔的笛卡尔网格点处采集数据,则W ( k ) 加权函数是常数,因此可以从等式1 中的求和中分解出来。 (11.1)。结果,m ( r )的重构变成了s ( k )上的快速傅里叶逆变换(FFT) ,这是一种极其高效的计算方法。在这种均匀间隔的笛卡尔网格点测量的数据集合称为笛卡尔扫描轨迹,如图11.1(a)所示。在实践中,笛卡尔扫描轨迹允许在扫描仪上直接实施,并且在当今的临床环境中广泛使用。
If data is acquired at uniformly spaced Cartesian grid points in the k-space under ideal conditions, then the W(k) weighting function is a constant and can thus be factored out of the summation in Eq. (11.1). As a result, the reconstruction of m(r) becomes an inverse fast Fourier transform (FFT) on s(k), an extremely efficient computation method. A collection of data measured at such uniformed spaced Cartesian grid points is referred to as a Cartesian scan trajectory, depicted in Figure 11.1(a). In practice, Cartesian scan trajectories allow straightforward implementation on scanners and are widely used in clinical settings today.
图 11.1扫描仪k空间轨迹及其相关的重建策略:(a) 采用 FFT 重建的笛卡尔轨迹,(b) 螺旋(或一般的非笛卡尔轨迹),然后网格化以实现 FFT 重建,以及 (c) 螺旋(非笛卡尔轨迹) -笛卡尔)轨迹与基于线性解算器的重建。
Figure 11.1 Scanner k-space trajectories and their associated reconstruction strategies: (a) Cartesian trajectory with FFT reconstruction, (b) spiral (or non-Cartesian trajectory in general) followed by gridding to enable FFT reconstruction, and (c) spiral (non-Cartesian) trajectory with linear solver–based reconstruction.
尽管笛卡尔扫描数据的逆FFT重建在计算上是高效的,但非笛卡尔扫描轨迹通常具有降低对患者运动的敏感性、更好地提供自校准场不均匀性信息的能力以及降低对扫描仪硬件性能的要求等优点。因此,人们提出了非笛卡尔扫描轨迹,如螺旋线(如图 11.1c所示)、径向线(投影成像)和玫瑰花结,以减少与运动相关的伪影并解决扫描仪硬件性能限制。这些改进最近使得重建的图像像素值可以用于测量微妙的现象,例如组织化学异常,然后才能成为解剖学病理学。图 11.2显示了生成钠图的测量方法,钠是正常人体组织中受到严格监管的物质。该信息可用于跟踪中风和癌症治疗过程中的组织健康状况。钠浓度的变化或变化给出了疾病发展或组织死亡的早期迹象。例如,图11.2所示的人脑钠图可以是用于早期指示脑肿瘤组织对化疗方案的反应,从而实现个体化医疗。由于人体组织中钠的含量远低于水分子,因此可靠地测量钠水平需要通过更多数量的样本获得更高的 SNR,并且需要使用非笛卡尔扫描轨迹控制额外的扫描时间。
Although the inverse FFT reconstruction of Cartesian scan data is computationally efficient, non-Cartesian scan trajectories often have advantages in reduced sensitivity to patient motion, better ability to provide self-calibrating field inhomogeneity information, and reduced requirements on scanner hardware performance. As a result, non-Cartesian scan trajectories like spirals (shown in Figure 11.1c), radial lines (projection imaging), and rosettes have been proposed to reduce motion-related artifacts and address scanner hardware performance limitations. These improvements have recently allowed the reconstructed image pixel values to be used for measuring subtle phenomenon such as tissue chemical anomalies before they become anatomical pathology. Figure 11.2 shows such a measurement that generates a map of sodium, a heavily regulated substance in normal human tissues. The information can be used to track tissue health in stroke and cancer treatment processes. The variation or shifting of sodium concentration gives early signs of disease development or tissue death. For example, the sodium map of a human brain shown in Figure 11.2 can be used to give an early indication of brain tumor tissue responsiveness to chemotherapy protocols, enabling individualized medicine. Because sodium is much less abundant than water molecules in human tissues, a reliable measure of sodium levels requires a higher SNR through a higher number of samples and needs to control the extra scan time with non-Cartesian scan trajectories.
图 11.2使用非笛卡尔k空间样本轨迹和基于精确线性解算器的重建可实现具有令人兴奋的医学应用的新 MRI 模式。改进的信噪比能够可靠地收集人体组织中钠等化学物质的体内浓度数据。钠浓度的变化或变化给出了疾病发展或组织死亡的早期迹象。例如,如图所示的人脑钠图可用于早期指示脑肿瘤组织对化疗方案的反应性,从而实现个体化医疗。
Figure 11.2 The use of non-Cartesian k-space sample trajectory and accurate linear solver–based reconstruction enables new MRI modalities with exciting medial applications. The improved SNR enables reliable collection of in-vivo concentration data on a chemical substance such as sodium in human tissues. The variation or shifting of sodium concentration gives early signs of disease development or tissue death. For example, the sodium map of a human brain shown in this Figure can be used to give early indication of brain tumor tissue responsiveness to chemo-therapy protocols, enabling individualized medicine.
非笛卡尔轨迹数据的图像重建既带来了挑战,也带来了机遇。主要挑战来自于指数项不再均匀分布的事实。求和不再具有 FFT 的形式。因此,人们不能再通过直接对k空间样本应用逆FFT来执行重建。在称为网格化的常用方法中,首先将样本插值到均匀的笛卡尔网格上,然后使用 FFT 进行重建(参见图 11.1b)。例如,网格化的卷积方法采用k空间数据点,将其与网格化内核进行卷积,并将结果累积在笛卡尔网格上。卷积的计算量相当大。加速多核处理器上的网格计算有助于将当前的 FFT 方法应用于非笛卡尔轨迹数据。由于我们已经在第 8 章中研究了卷积模式,并将在第 12 章中研究卷积式计算,因此我们不会在这里介绍它。
Image reconstruction from non-Cartesian trajectory data presents both challenges and opportunities. The main challenge arises from the fact that the exponential terms are no longer uniformly spaced; the summation does not have the form of an FFT anymore. Therefore, one can no longer perform reconstruction by directly applying an inverse FFT to the k-space samples. In a commonly used approach called gridding, the samples are first interpolated onto a uniform Cartesian grid and then reconstructed using the FFT (see Figure 11.1b). For example, a convolution approach to gridding takes a k-space data point, convolves it with a gridding kernel, and accumulates the results on a Cartesian grid. Convolution is quite computationally intensive. Accelerating gridding computation on many-core processors facilitates the application of the current FFT approach to non-Cartesian trajectory data. Since we have already studied the convolution pattern in Chapter 8 and will be examining a convolution-style computation in Chapter 12, we will not cover it here.
在本章中,我们将介绍一种迭代的、统计上最优的图像重建方法,该方法可以准确地模拟成像物理和限制每个图像像素值中的噪声误差。然而,这种迭代重建对于大规模 3D 问题来说是不切实际的,因为与网格化相比,它们的计算要求过高。最近,当在 GPU 上加速时,这些重建在临床环境中变得可行。特别是,我们将展示过去使用高端顺序 CPU 需要花费数小时的迭代重建算法,现在使用 CPU 和 GPU 仅需几分钟即可获得中等分辨率的图像,这是临床环境中可接受的延迟。
In this chapter, we will cover an iterative, statistically optimal image reconstruction method that can accurately model imaging physics and bound the noise error in each image pixel value. However, such iterative reconstructions have been impractical for large-scale 3D problems due to their excessive computational requirements compared to gridding. Recently, these reconstructions have become viable in clinical settings when accelerated on GPUs. In particular, we will show that an iterative reconstruction algorithm that used to take hours using a high-end sequential CPU now takes only minutes using both CPUs and GPUs for an image of moderate resolution, a delay acceptable in clinical settings.
Haldar等人[HHB 2007]针对非笛卡尔扫描数据提出了一种基于线性求解器的迭代重建算法,如图11.1(c)所示。该算法允许对扫描仪数据采集过程的物理特性进行显式建模和补偿,从而减少重建图像中的伪影。然而,它的计算成本很高。对于中等分辨率图像,高端顺序 CPU 的重建时间为数小时,因此在临床使用中不切实际。我们以此作为创新方法的一个例子,这些方法需要太多的计算时间才能被认为是实用的。我们将证明大规模并行性可以将重建时间减少到一分钟左右,以便人们可以在临床环境中部署新的 MRI 模式,例如钠成像。
Haldar, et al [HHB 2007] proposed a linear solver–based iterative reconstruction algorithm for non-Cartesian scan data, as shown in Figure 11.1(c). The algorithm allows for explicit modeling and compensation for the physics of the scanner data acquisition process, and can thus reduce the artifacts in the reconstructed image. It is, however, computationally expensive. The reconstruction time on high-end sequential CPUs has been hours for moderate-resolution images and thus impractical in clinical use. We use this as an example of innovative methods that have required too much computation time to be considered practical. We will show that massive parallelism can reduce the reconstruction time to the order of a minute so that one can deploy the new MRI modalities such as sodium imaging in clinical settings.
图 11.3显示了基于迭代线性求解器的重建方法的准贝叶斯估计问题公式的解决方案,其中 ρ 是包含重建图像体素值的向量,F 是模拟成像过程物理的矩阵,D 是来自扫描仪的数据样本向量,W 是可以合并先验信息(例如解剖约束)的矩阵。在临床环境中,W 中表示的解剖学约束源自对患者的一次或多次高分辨率、高信噪比水分子扫描。这些水分子扫描揭示了解剖结构位置等特征。矩阵W是从这些参考图像导出的。问题是在给定所有其他矩阵和向量的情况下求解 ρ 。
Figure 11.3 shows a solution of the quasi-Bayesian estimation problem formulation of the iterative linear solver–based reconstruction approach, where ρ is a vector containing voxel values for the reconstructed image, F is a matrix that models the physics of imaging process, D is a vector of data samples from the scanner, and W is a matrix that can incorporate prior information such as anatomical constraints. In clinical settings, the anatomical constraints represented in W are derived from one or more high-resolution, high-SNR water molecule scans of the patient. These water molecule scans reveal features such as the location of anatomical structures. The matrix W is derived from these reference images. The problem is to solve for ρ given all the other matrices and vectors.
图 11.3基于迭代线性求解器的非笛卡尔k空间样本数据重建方法。
Figure 11.3 An iterative linear solver–based approach to reconstruction of no-Cartesian k-space sample data.
从表面上看,图 11.3中问题表述的计算解决方案应该非常简单。它涉及矩阵 - 矩阵乘法和加法 (F H F+λW H W)、矩阵 - 向量乘法 (F H D)、矩阵求逆 (F H F+λW H W) -1,最后是矩阵 - 矩阵乘法 (( F H F+λW H W) −1 ×F H D)。然而,矩阵的大小使得这种简单的方法极其耗时。 F H和 F 是 3D 矩阵,其维度由重建图像 ρ 的分辨率确定。即使在中等分辨率的 128 3体素重建中, F 中也有 128 3列,每列中具有N 个元素,其中N是所使用的k空间样本的数量。显然,F 非常大。
On the surface, the computational solution to the problem formulation in Figure 11.3 should be very straightforward. It involves matrix–matrix multiplications and addition (FHF+λWHW), matrix–vector multiplication (FHD), matrix inversion (FHF+λWHW)−1, and finally matrix–matrix multiplication ((FHF+λWHW)−1×FHD). However, the sizes of the matrices make this straightforward approach extremely time consuming. FH and F are 3D matrices of which the dimensions are determined by the resolution of the reconstructed image ρ. Even in a modest resolution 1283-voxel reconstruction, there are 1283 columns in F with N elements in each column where N is the number of k-space samples used. Obviously, F is extremely large.
所涉及的矩阵的大小是如此之大,以至于图 11.3中的方程的直接求解所涉及的矩阵运算实际上非常棘手。因此,首选矩阵求逆的迭代方法,例如共轭梯度 (CG) 算法。共轭梯度算法通过迭代求解图 11.3中的方程得到 ρ 来重建图像。在每次迭代期间,CG算法都会更新当前图像估计ρ以提高准贝叶斯成本函数的值。 CG技术的计算效率很大程度上取决于涉及F H F+λW H W和ρ的矩阵向量乘法运算的效率,因为在CG算法的每次迭代期间都需要这些运算。
The sizes of the matrices involved are so large that the matrix operations involved in a direct solution of the equation in Figure 11.3 are practically intractable. An iterative method for matrix inversion, such as the conjugate gradient (CG) algorithm, is therefore preferred. The conjugate gradient algorithm reconstructs the image by iteratively solving the equation in Figure 11.3 for ρ. During each iteration, the CG algorithm updates the current image estimate ρ to improve the value of the quasi-Bayesian cost function. The computational efficiency of the CG technique is largely determined by the efficiency of matrix–vector multiplication operations involving FHF+λWHW and ρ, as these operations are required during each iteration of the CG algorithm.
幸运的是,矩阵 W 通常具有稀疏结构,允许与 W H W 进行有效乘法,而矩阵 F H F 是 Toeplitz,可以通过 FFT 实现有效的矩阵向量乘法。斯通等人。[SHT2008]提出了一种用于计算 Q 的 GPU 加速方法,Q 是一种数据结构,允许我们快速计算涉及 F H F 的矩阵向量乘法,而无需实际计算 F H F 本身。在高端 CPU 内核上,Q 的计算可能需要数天的时间。对于给定的轨迹只需执行一次,并且可以用于多次扫描。
Fortunately, matrix W often has a sparse structure that permits efficient multiplication by WHW, and matrix FHF is Toeplitz that enables efficient matrix–vector multiplication via the FFT. Stone et al. [SHT2008] present a GPU-accelerated method for calculating Q, a data structure that allows us to quickly calculate matrix–vector multiplication involving FHF without actually calculating FHF itself. The calculation of Q can take days on a high-end CPU core. It only needs to be done once for a given trajectory and can be used for multiple scans.
用于计算 F H D的矩阵向量乘法所需的时间比 Q 少大约一个数量级,但仍需要大约三个小时在高端顺序 CPU 上进行128 3体素重建。由于每次图像采集都需要计算F HD,因此希望将F HD 的计算时间减少到分钟。1我们将展示此过程的详细信息。事实证明,Q 的核心计算结构与 F H D 的核心计算结构相同。因此,可以使用相同的方法来加速两者的计算。
The matrix–vector multiply to calculate FHD takes about one order of magnitude less time than Q but can still take about three hours for a 1283-voxel reconstruction on a high-end sequential CPU. Since FHD needs to be computed for every image acquisition, it is desirable to reduce the computation time of FHD to minutes.1 We will show the details of this process. As it turns out, the core computational structure of Q is identical to that of FHD. As a result, the same methodology can be used to accelerate the computation of both.
图 11.3中的“查找 ρ”步骤执行基于 F H D 的实际 CG。正如我们之前所解释的,Q 的预先计算使得该步骤的计算强度比 F H D低得多,并且仅占执行量的不到 1%在顺序 CPU 上重建每个图像的过程。因此,我们将其排除在并行化范围之外,并在本章中重点关注FHD 。然而,我们将在本章末尾重新审视它的状态。
The “find ρ” step in Figure 11.3 performs the actual CG based on FHD. As we explained earlier, precalculation of Q makes this step much less computationally intensive than FHD, and accounts for only less than 1% of the execution of the reconstruction of each image on a sequential CPU. As a result, we will leave it out of the parallelization scope and focus on FHD in this chapter. We will, however, revisit its status at the end of the chapter.
图 11.4显示了计算与 F H F相乘的数据结构(在图 11.4a中称为 Q 计算)的核心步骤的计算的顺序 C 实现,而无需显式计算 F H F,以及 F H D (图 11.4b)。快速浏览图 11.4(a) 和 (b)应该清楚,Q 和 F H D的核心步骤具有相同的结构。两种计算都从包含内循环的外循环开始。唯一的区别是每个循环体中完成的特定计算以及 Q 的核心步骤涉及更大的m的事实,因为它实现的是矩阵-矩阵乘法而不是矩阵-向量乘法,因此它会产生更长的时间执行时间处理时间。因此,讨论其中之一就足够了。我们将重点关注 F H D,因为这是每个正在重建的图像都需要运行的模型。
Figure 11.4 shows a sequential C implementation of the computations for the core step of computing a data structure for multiplications with FHF (referred to as Q computation in Figure 11.4a) without explicitly calculating FHF, and that for FHD (Figure 11.4b). It should be clear from a quick glance at Figures 11.4(a) and (b) that the core steps of Q and FHD have identical structures. Both computations start with an outer loop that encloses an inner loop. The only differences are the particular calculation done in each loop body and the fact that the core step of Q involves a much larger m, since it implements a matrix–matrix multiplication as opposed to a matrix–vector multiplication, thus it incurs a much longer execution time. Thus, it suffices to discuss one of them. We will focus on FHD, since this is the one that will need to be run for each image being reconstructed.
图 11.4 (a) Q 和 (b) F H D的计算。
Figure 11.4 Computation of (a) Q and (b) FHD.
快速浏览一下图 11.4(b),就会发现 F HD的 C 实现是 GPU 加速的绝佳候选者,因为它表现出大量的数据并行性。该算法首先计算k空间中每个样本点处的 Mu 实部和虚部(rMu和iMu ) 。然后,它计算图像空间中每个体素处的 F H D的实部和虚部。任何体素处的 F H D值取决于所有k空间样本点的值。然而,没有体素元素F H D 依赖于 F H D 的任何其他元素。因此,F H D 的所有元素都可以并行计算。具体地,外循环的所有迭代可以并行完成并且内循环的所有迭代可以并行完成。然而,内循环的计算依赖于外循环的同一迭代中前面的语句所完成的计算。
A quick glance at Figure 11.4(b) shows that the C implementation of FHD is an excellent candidate for acceleration on the GPU because it exhibits substantial data parallelism. The algorithm first computes the real and imaginary components of Mu (rMu and iMu) at each sample point in the k-space. It then computes the real and imaginary components of FHD at each voxel in the image space. The value of FHD at any voxel depends on the values of all k-space sample points. However, no voxel elements of FHD depend on any other elements of FHD. Therefore, all elements of FHD can be computed in parallel. Specifically, all iterations of the outer loop can be done in parallel and all iterations of the inner loop can be done in parallel. The calculations of the inner loop, however, have a dependence on the calculation done by the preceding statements in the same iteration of the outer loop.
尽管该算法具有丰富的固有并行性,但潜在的性能瓶颈是显而易见的。首先,在计算F H D的元素的循环中,浮点运算与存储器访问的比率最好是3:1,最差是1:1。最好的情况假设使用五元素泰勒级数计算sin和cos三角运算,分别需要 13 次和 12 次浮点运算。最坏的情况假设每个三角运算都作为硬件中的单个运算进行计算。正如我们在第 5 章中所看到的,为了使内核不受内存带宽的限制,浮点与内存访问比率需要达到 16:1 或更高。因此,除非大幅增加该比率,否则内存访问将明显限制内核的性能。
Despite the algorithm’s abundant inherent parallelism, potential performance bottlenecks are evident. First, in the loop that computes the elements of FHD, the ratio of floating-point operations to memory accesses is at best 3:1 and at worst 1:1. The best case assumes that the sin and cos trigonometry operations are computed using the five-element Taylor series that requires 13 and 12 floating-point operations, respectively. The worst case assumes that each trigonometric operation is computed as a single operation in hardware. As we have seen in Chapter 5, a floating-point to memory access ratio of 16:1 or more is needed for the kernel to not be limited by memory bandwidth. Thus, the memory accesses will clearly limit the performance of the kernel unless the ratio is drastically increased.
其次,浮点算术与浮点三角函数的比例仅为13:2。因此,基于 GPU 的实现必须容忍或避免由于长延迟正弦和余弦运算而导致的停顿。如果没有好的方法来降低三角函数的成本,性能可能会受到这些函数所花费的时间的支配。
Second, the ratio of floating-point arithmetic to floating-point trigonometry functions is only 13:2. Thus, a GPU-based implementation must tolerate or avoid stalls due to long-latency sin and cos operations. Without a good way to reduce the cost of trigonometry functions, the performance will likely be dominated by the time spent in these functions.
我们现在准备采取步骤将 FHD从顺序 C 代码转换为 CUDA 内核。
We are now ready to take the steps in converting FHD from sequential C code to a CUDA kernel.
从概念上讲,将循环转换为 CUDA 内核非常简单。由于图 11.4(b)的外循环的所有迭代都可以并行执行,因此我们可以通过将其迭代映射到 CUDA 线程来简单地将外循环转换为 CUDA 内核。图 11.5显示了通过这种简单转换得到的内核。每个线程都实现原始外循环的迭代。也就是说,我们使用每个线程来计算一个k空间样本对所有 F H D 元素的贡献。原始外循环有M 次迭代,M可以是数百万次。显然,我们需要有多个线程块来生成足够的线程来实现所有这些迭代。
The conversion of a loop into a CUDA kernel is conceptually straightforward. Since all iterations of the outer loop of Figure 11.4(b) can be executed in parallel, we can simply convert the outer loop into a CUDA kernel by mapping its iterations to CUDA threads. Figure 11.5 shows a kernel from such a straightforward conversion. Each thread implements an iteration of the original outer loop. That is, we use each thread to calculate the contribution of one k-space sample to all FHD elements. The original outer loop has M iterations, and M can be in the millions. We obviously need to have multiple thread blocks to generate enough threads to implement all these iterations.
图 11.5 F H D 内核的第一个版本。由于写入rFhD和iFhD数组的线程之间存在冲突,内核将无法正确执行。
Figure 11.5 First version of the FHD kernel. The kernel will not execute correctly due to conflicts between threads in writing into rFhD and iFhD arrays.
为了使性能调整变得容易,我们声明一个常量FHD_THREADS_PER_BLOCK ,它定义了调用cmpFHd内核时每个线程块中的线程数。因此,我们将使用M/FHD_THREADS_PER_BLOCK作为网格大小(以块数表示),并使用FHD_THREADS_PER_BLOCK作为内核调用的块大小(以线程数表示)。在内核中,每个线程计算使用公式blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x分配覆盖的外循环的原始迭代。例如,假设有 65,536 k空间样本,并且我们决定每个块使用 512 个线程。内核创新时的网格大小为 65,536÷512=128 个区块。块大小为 512。每个线程的m计算相当于blockIdx.x*512 + threadIdx。
To make performance tuning easy, we declare a constant FHD_THREADS_PER_BLOCK that defines the number of threads in each thread block when we invoke the cmpFHd kernel. Thus, we will use M/FHD_THREADS_PER_BLOCK for the grid size (in terms of number of blocks) and FHD_THREADS_PER_BLOCK for block size (in terms of number of threads) for kernel invocation. Within the kernel, each thread calculates the original iteration of the outer loop that it is assigned to cover using the formula blockIdx.x ∗ FHD_THREADS_PER_BLOCK + threadIdx.x. For example, assume that there are 65,536 k-space samples and we decided to use 512 threads per block. The grid size at kernel innovation would be 65,536÷512=128 blocks. The block size would be 512. The calculation of m for each thread would be equivalent to blockIdx.x∗512 + threadIdx.
虽然图 11.5的内核利用了充足的并行性,但它遇到了一个主要问题:所有线程都写入所有rFhD和iFhD体素元素。这意味着内核必须在内循环中的全局内存中使用原子操作,以防止线程破坏彼此对体素值的贡献。这会严重影响内核的性能。请注意,由于没有使用原子操作,因此代码甚至无法正确执行。我们需要探索其他选择。
While the kernel of Figure 11.5 exploits ample parallelism, it suffers from a major problem: all threads write into all rFhD and iFhD voxel elements. This means that the kernel must use atomic operations in the global memory in the inner loop to keep threads from trashing each other’s contributions to the voxel value. This can seriously affect the performance of the kernel. Note that as is, the code will not even execute correctly since no atomic operation is used. We need to explore other options.
另一种选择是使用每个线程从所有k空间样本计算一个FhD值。为此,我们需要首先交换内循环和外循环,以便每个新的外循环迭代处理一个FhD元素。也就是说,每个新的外循环迭代将执行新的内循环,该新的内循环累积所有k空间样本对由外循环迭代处理的FhD元素的贡献。这种对循环结构的转换称为循环交换。它需要一个完美的嵌套循环,这意味着外部for循环语句和内部for循环语句之间没有任何语句。然而,对于图 11.4(b)中的 FHD 码来说,情况并非如此。我们需要找到一种方法来移动rMu 和 iMu 元素的计算。
The other option is to use each thread to calculate one FhD value from all k-space samples. To do so, we need to first swap the inner loop and the outer loop so that each of the new outer loop iterations processes one FhD element. That is, each of the new outer loop iterations will execute the new inner loop that accumulates the contribution of all k-space samples to the FhD element handled by the outer loop iteration. This transformation to the loop structure is called loop interchange. It requires a perfectly nested loop, meaning that there is no statement between the outer for loop statement and the inner for loop statement. This is, however, not true for the FHD code in Figure 11.4(b). We need to find a way to move the calculation of rMu and iMu elements out of the way.
通过对图 11.6(a) (图 11.4(b)的复制)的快速检查,我们可以看到 F H D 计算可以使用一种技术分为两个单独的循环,如图11.6(b)所示称为环裂变或环分裂。此转换采用循环体并将其分成两个循环。对于F H D,外循环由两部分组成:内循环之前的语句和内循环。如图11.6(b)所示,我们可以通过将内循环之前的语句放入一个循环,将内循环放入第二个循环,对外循环进行循环裂变。该转换改变了原始外循环两部分的相对执行顺序。在原始外循环中,第一次迭代的两个部分都在第二次迭代之前执行。裂变后,所有迭代的第一部分将执行;然后是所有迭代的第二部分。读者应该能够验证执行顺序的这种变化不会影响 F H D 的执行结果。这是因为每次迭代的第一部分的执行不依赖于任何先前迭代的第二部分的结果这原来的外循环。循环裂变是一种通常由高级编译器完成的转换,能够分析循环迭代中语句之间的(缺乏)依赖性。
From a quick inspection of Figure 11.6(a), which is a replicate of Figure 11.4(b), we see that the FHD calculation can be split into two separate loops, as shown in Figure 11.6(b), using a technique called loop fission or loop splitting. This transformation takes the body of a loop and splits it into two loops. In the case of FHD, the outer loop consists of two parts: the statements before the inner loop and the inner loop. As shown in Figure 11.6(b), we can perform loop fission on the outer loop by placing the statements before the inner loop into a loop and the inner loop into a second loop. The transformation changes the relative execution order of the two parts of the original outer loop. In the original outer loop, both parts of the first iteration execute before the second iteration. After fission, the first part of all iterations will execute; they are then followed by the second part of all iterations. Readers should be able to verify that this change of execution order does not affect the execution results for FHD. This is because the execution of the first part of each iteration does not depend on the result of the second part of any preceding iterations of the original outer loop. Loop fission is a transformation often done by advanced compilers that are capable of analyzing the (lack of) dependence between statements across loop iterations.
图 11.6 (a) F H D 计算中的循环裂变和 (b) 循环裂变之后。
Figure 11.6 (a) Loop fission on the FHD computation and (b) after loop fission.
通过循环裂变,F H D 计算现在分两步完成。第一步是单级循环,计算第二个循环中使用的rMu和iMu元素。第二步对应于根据第一步计算的rMu和iMu元素计算FhD元素的循环。现在每个步骤都可以转换为 CUDA 内核。两个 CUDA 内核将彼此顺序执行。由于第二个循环需要使用第一个循环的结果,因此将这两个循环分成两个按顺序执行的内核不会牺牲任何并行性。
With loop fission, the FHD computation is now done in two steps. The first step is a single-level loop that calculates the rMu and iMu elements for use in the second loop. The second step corresponds to the loop that calculates the FhD elements based on the rMu and iMu elements calculated in the first step. Each step can now be converted into a CUDA kernel. The two CUDA kernels will execute sequentially with respect to each other. Since the second loop needs to use the results from the first loop, separating these two loops into two kernels that execute in sequence does not sacrifice any parallelism.
图 11.7中的cmpMu()内核实现了第一个循环。第一个循环从顺序 C 代码到 CUDA 内核的转换非常简单:每个线程实现原始 C 代码的一次迭代。由于M值可能非常大,反映了大量的k空间样本,因此这种映射可能会导致大量的线程。每个块中有 512 个线程,我们需要使用多个块来允许大量线程。这可以通过在每个块中拥有多个线程(由图 11.4(c)中的MU_THREADS_PER_BLOCK指定)并使用覆盖原始循环的所有M次迭代所需的M/MU_THREADS_PER_BLOCK块来实现。例如,如果有 65,536 k空间样本,则可以使用每个块 512 个线程和 65,536÷512=128 个块的配置来调用内核。这是通过在内核创新期间将 512 分配给MU_THREADS_PER_BLOCK并使用MU_THREADS_PER_BLOCK作为块大小和M/MU_THREADS_PER_BLOCK作为网格大小来完成的。
The cmpMu() kernel in Figure 11.7 implements the first loop. The conversion of the first loop from sequential C code to a CUDA kernel is straightforward: each thread implements one iteration of the original C code. Since the M value can be very big, reflecting the large number of k-space samples, such a mapping can result in a large number of threads. With 512 threads in each block, we will need to use multiple blocks to allow the large number of threads. This can be accomplished by having a number of threads in each block, specified by MU_THREADS_PER_BLOCK in Figure 11.4(c), and by employing M/MU_THREADS_PER_BLOCK blocks needed to cover all M iterations of the original loop. For example, if there are 65,536 k-space samples, the kernel could be invoked with a configuration of 512 threads per block and 65,536÷512=128 blocks. This is done by assigning 512 to MU_THREADS_PER_BLOCK and using MU_THREADS_PER_BLOCK as the block size and M/MU_THREADS_PER_BLOCK as the grid size during kernel innovation.
图 11.7 cmpMu内核。
Figure 11.7 cmpMu kernel.
在内核中,每个线程都可以使用其blockIdx和threadIdx值来识别分配给它的迭代。由于线程结构是一维的,因此只需要使用blockIdx.x和threadIdx.x 。因为每个块覆盖原始迭代的一部分,所以线程覆盖的迭代是blockIdx.x*MU_THREADS_PER_BLOCK + threadIdx。例如,假设MU_THREADS_PER_BLOCK=512。blockIdx.x=0和threadIdx.x=37 的线程覆盖原始循环的第 37 次迭代,而 blockIdx.x =5和threadIdx.x=2 的线程覆盖第 2,562 次 (5×512+2) 迭代原始循环的。使用此迭代次数访问Mu、Phi和D数组可确保线程覆盖数组的方式与原始循环迭代覆盖数组的方式相同。因为每个线程都写入自己的Mu元素,所以这些线程之间不存在潜在的冲突。
Within the kernel, each thread can identify the iteration assigned to it using its blockIdx and threadIdx values. Since the threading structure is one dimensional, only blockIdx.x and threadIdx.x need to be used. Because each block covers a section of the original iterations, the iteration covered by a thread is blockIdx.x∗MU_THREADS_PER_BLOCK + threadIdx. For example, assume that MU_THREADS_PER_BLOCK=512. The thread with blockIdx.x=0 and threadIdx.x=37 covers the 37th iteration of the original loop, whereas the thread with blockIdx.x=5 and threadIdx.x=2 covers the 2,562nd (5×512+2) iteration of the original loop. Using this iteration number to access the Mu, Phi, and D arrays ensures that the arrays are covered by the threads in the same way they were covered by the iterations of the original loop. Because every thread writes into its own Mu element, there is no potential conflict between any of these threads.
确定第二个内核的结构需要做更多的工作。对图 11.6(b)中第二个循环的检查表明,设计第二个内核时至少有三个选项。在第一个选项中,每个线程对应于内部循环的一次迭代。此选项创建最多数量的线程,从而利用最大数量的并行性。然而,线程的数量将为N × M,其中N的范围为数百万,M 的范围为数十万。他们的产品会导致网格中线程过多。
Determining the structure of the second kernel requires a little more work. An inspection of the second loop in Figure 11.6(b) shows that there are at least three options in designing the second kernel. In the first option, each thread corresponds to one iteration of the inner loop. This option creates the most number of threads and thus exploits the largest amount of parallelism. However, the number of threads would be N×M, with both N in the range of millions and M in the range of hundreds of thousands. Their product would result in too many threads in the grid.
第二种选择是使用每个线程来实现外循环的迭代。此选项比第一个选项使用更少的线程。该选项不是生成N × M线程,而是生成M 个线程。由于M对应于k空间样本的数量,并且通常使用大量样本(大约十万)来计算F H D,因此该选项仍然利用大量并行性。然而,这个内核遇到了与图 11.5中的内核相同的问题。也就是说,每个线程都会写入所有rFhD和iFhD元素,从而在线程之间产生极其大量的冲突。与图 11.5的情况一样,如果不添加原子操作,图 11.8中的代码将无法正确执行,这会显着减慢执行速度。因此,这个选项效果不佳。
A second option is to use each thread to implement an iteration of the outer loop. This option employs fewer threads than the first option. Instead of generating N×M threads, this option generates M threads. Since M corresponds to the number of k-space samples and a large number of samples (on the order of a hundred thousand) are typically used to calculate FHD, this option still exploits a large amount of parallelism. However, this kernel suffers the same problem as the kernel in Figure 11.5. That is, each thread will write into all rFhD and iFhD elements, thus creating an extremely large number of conflicts between threads. As is the case of Figure 11.5, the code in Figure 11.8 will not execute correctly without adding atomic operations that will significantly slow down the execution. Thus, this option does not work well.
图 11.8 F H D 内核的第二个选项。
Figure 11.8 Second option of the FHD kernel.
第三种选择是使用每个线程来计算一对rFhD和iFhD元素。此选项要求我们互换内循环和外循环,然后使用每个线程来实现新外循环的迭代。变换如图11.9所示。循环交换是必要的,因为 CUDA 线程实现的循环必须是外循环。循环交换使得每个新的外循环迭代处理一对rFhD和iFhD元素。这里允许循环互换,因为两级循环的所有迭代都是彼此独立的。它们可以以任何相对于彼此的顺序执行。当这些迭代可以以任何顺序执行时,允许改变迭代顺序的循环交换。该选项生成N 个线程。由于N对应于重建图像中的体素数量,因此对于更高分辨率的图像,N值可能非常大。对于128 3 个图像来说,有128 3 =2,097,152 个线程,导致大量的并行性。对于更高分辨率,例如 512 3,我们可能需要调用多个内核,每个内核生成体素子集的值。请注意,这些线程现在都累积到它们自己的rFhD和iFhd元素中,因为每个线程都有唯一的n值。线程之间不存在冲突。这些线程可以完全并行运行。这使得第三个选项成为三个选项中的最佳选择。
A third option is to use each thread to compute one pair of rFhD and iFhD elements. This option requires us to interchange the inner and outer loops and then use each thread to implement an iteration of the new outer loop. The transformation is shown in Figure 11.9. Loop interchange is necessary because the loop being implemented by the CUDA threads must be the outer loop. Loop interchange makes each of the new outer loop iterations process a pair of rFhD and iFhD elements. Loop interchange is permissible here because all iterations of both levels of loops are independent of each other. They can be executed in any order relative to one another. Loop interchange, which changes the order of the iterations, is allowed when these iterations can be executed in any order. This option generates N threads. Since N corresponds to the number of voxels in the reconstructed image, the N value can be very large for higher-resolution images. For a 1283 images, there are 1283=2,097,152 threads, resulting in a large amount of parallelism. For higher resolutions, such as 5123, we may need to invoke multiple kernels, and each kernel generates the value of a subset of the voxels. Note these threads now all accumulate into their own rFhD and iFhd elements since every thread has a unique n value. There is no conflict between threads. These threads can run totally in parallel. This makes the third option the best choice among the three options.
图 11.9 F H D 计算的循环交换。
Figure 11.9 Loop interchange of the FHD computation.
从交换循环中导出的内核如图11.10所示。线程被组织为两级结构。外环已被剥离;每个线程覆盖外层 ( n ) 循环的迭代,其中n等于blockIdx.x*FHD_THREADS_PER_BLOCK + threadIdx.x。一旦确定了迭代 ( n ) 值,线程就会根据该n值执行内部循环。该内核可以通过每个块中的多个线程来调用,由全局常量FHD_THREADS_PER_BLOCK指定。假设N是给出重建图像中体素数量的变量,则 N /FHD_THREADS_PER_BLOCK块覆盖原始循环的所有N次迭代。例如,如果有 65,536 k空间样本,则可以使用每个块 512 个线程和 65,536÷512=128 个块的配置来调用内核。这是通过在内核创新期间将 512 分配给FHD_THREADS_PER_BLOCK并使用FHD_THREADS_PER_BLOCK作为块大小和N/FHD_THREADS_PER_BLOCK作为网格大小来完成的。
The kernel derived from the interchanged loops is shown in Figure 11.10. The threads are organized as a two-level structure. The outer loop has been stripped away; each thread covers an iteration of the outer (n) loop, where n is equal to blockIdx.x∗FHD_THREADS_PER_BLOCK + threadIdx.x. Once this iteration (n) value is identified, the thread executes the inner loop based on that n value. This kernel can be invoked with a number of threads in each block, specified by a global constant FHD_THREADS_PER_BLOCK. Assuming that N is the variable that gives the number of voxels in the reconstructed image, N/FHD_THREADS_PER_BLOCK blocks cover all N iterations of the original loop. For example, if there are 65,536 k-space samples, the kernel could be invoked with a configuration of 512 threads per block and 65,536÷512=128 blocks. This is done by assigning 512 to FHD_THREADS_PER_BLOCK and using FHD_THREADS_PER_BLOCK as the block size and N/FHD_THREADS_PER_BLOCK as the grid size during kernel innovation.
图 11.10 F H D 内核的第三个选项。
Figure 11.10 Third option of the FHD kernel.
由于内存带宽限制,图 11.10中的简单cmpFhD内核将提供有限的加速。快速分析表明,执行受到每个线程的低计算与内存访问比率的限制。在原始循环中,每次迭代至少执行 14 次内存访问:kx[m]、ky[m]、kz[m]、x[n]、y[n]、z[n]、rMu[m]两次,iMu[m]两次,rFhD[n]读写,iFhD[n]读写。同时,每次迭代中执行约 13 次浮点乘法、加法或三角运算。因此,计算与内存访问的比率大约为 1,根据我们在第 5 章中的分析,这个比率太低了。
The simple cmpFhD kernel in Figure 11.10 will provide limited speedup due to memory bandwidth limitations. A quick analysis shows that the execution is limited by the low compute to memory access ratio of each thread. In the original loop, each iteration performs at least 14 memory accesses: kx[m], ky[m], kz[m], x[n], y[n], z[n], rMu[m] twice, iMu[m] twice, rFhD[n] read and write, and iFhD[n] read and write. Meanwhile, about 13 floating-point multiply, add, or trigonometry operations are performed in each iteration. Therefore, the compute to memory access ratio is approximately 1, which is too low according to our analysis in Chapter 5.
我们可以通过将一些数组元素分配给自动变量来立即提高计算与内存访问的比率。正如我们在第 5 章中讨论的,自动变量将驻留在寄存器中,从而将对全局存储器的读取和写入转换为对片上寄存器的读取和写入。快速回顾一下图 11.10中的内核可以看出,对于每个线程,在for循环的所有迭代中都使用相同的x[n]、y[n]和z[n]元素。这意味着我们可以在执行进入循环之前将这些元素加载到自动变量中。2然后内核可以在循环内使用自动变量,从而转换全局内存访问到寄存器访问。此外,循环重复读取和写入rFhD[n]和iFhD[n]。我们可以让迭代读取和写入两个自动变量,并且仅在执行退出循环后将这些自动变量的内容写入rFhD[n]和iFhD[n]中。生成的代码如图 11.11所示。通过将每个线程使用的寄存器数量增加 5 个,我们将每次迭代中完成的内存访问从 14 次减少到 7 次。因此,我们将计算与内存访问的比率从 13:14 增加到 13:7。这是一个非常好的改进,很好地利用了宝贵的寄存器资源。
We can immediately improve the compute to memory access ratio by assigning some of the array elements to automatic variables. As we discussed in Chapter 5, the automatic variables will reside in registers, thus converting reads and writes to the global memory into reads and writes to on-chip registers. A quick review of the kernel in Figure 11.10 shows that for each thread, the same x[n], y[n], and z[n] elements are used across all iterations of the for loop. This means that we can load these elements into automatic variables before the execution enters the loop.2 The kernel can then use the automatic variables inside the loop, thus converting global memory accesses to register accesses. Furthermore, the loop repeatedly reads from and writes into rFhD[n] and iFhD[n]. We can have the iterations read from and write into two automatic variables and only write the contents of these automatic variables into rFhD[n] and iFhD[n] after the execution exits the loop. The resulting code is shown in Figure 11.11. By increasing the number of registers used by 5 for each thread, we have reduced the memory access done in each iteration from 14 to 7. Thus, we have increased the compute to memory access ratio from 13:14 to 13:7. This is a very good improvement and a good use of the precious register resource.
图11.11在F H D 内核中使用寄存器来减少内存访问。
Figure 11.11 Using registers to reduce memory accesses in the FHD kernel.
回想一下,寄存器的使用可以限制可以在流式多处理器 (SM) 中运行的块的数量。通过在内核代码中将寄存器使用量增加 5,我们将每个线程块的寄存器使用量增加5*FHD_THREADS_PER_BLOCK。假设每个块有 128 个线程,我们只是将块寄存器使用量增加了 640 个。由于每个 SM 可以在分配给它的所有块中容纳 65,536 个寄存器的组合寄存器使用量(至少在 SM 3.5 版本中),因此我们需要请小心,因为寄存器使用量的任何进一步增加都可能开始限制可分配给 SM 的块数量。幸运的是,寄存器的使用并不是该内核并行性的限制因素。
Recall that the register usage can limit the number of blocks that can run in a streaming multiprocessor (SM). By increasing the register usage by 5 in the kernel code, we increase the register usage of each thread block by 5∗FHD_THREADS_PER_BLOCK. Assuming that we have 128 threads per block, we just increased the block register usage by 640. Since each SM can accommodate a combined register usage of 65,536 registers among all blocks assigned to it (at least in SM version 3.5), we need to be careful, as any further increase of register usage can begin to limit the number of blocks that can be assigned to an SM. Fortunately, the register usage is not a limiting factor to parallelism for this kernel.
我们希望通过消除更多的全局内存访问来进一步提高计算与内存访问的比率,使其接近 10:1。cmpFHD内核。接下来要考虑的候选是k空间样本kx[m]、ky[m]和kz[m]。这些数组元素的访问方式与x[n]、y[n]和z[n]元素不同:在图 11.11中循环的每次迭代中访问kx、ky和kz的不同元素。这意味着我们无法将每个k空间元素加载到自动变量寄存器中并通过所有迭代从寄存器访问该自动变量。所以,寄存器在这里没有帮助。但是,我们应该注意到k空间元素不会被内核修改。这意味着我们可以将k空间元素放入常量内存中。也许常量内存的缓存可以消除大部分内存访问。
We want to further improve the compute to memory access ratio to something closer to 10:1 by eliminating more global memory accesses in the cmpFHD kernel. The next candidates to consider are the k-space samples kx[m], ky[m], and kz[m]. These array elements are accessed differently than the x[n], y[n], and z[n] elements: different elements of kx, ky, and kz are accessed in each iteration of the loop in Figure 11.11. This means that we cannot load each k-space element into an automatic variable register and access that automatic variable off a register through all the iterations. So, registers will not help here. However, we should notice that the k-space elements are not modified by the kernel. This means that we might be able to place the k-space elements into the constant memory. Perhaps the cache for the constant memory can eliminate most of the memory accesses.
对图 11.11中的循环的分析表明,k空间元素确实是常量内存的绝佳候选者。用于访问kx、ky和kz 的索引是m。m独立于threadIdx ,这意味着 warp 中的所有线程都将访问kx、ky和kz的相同元素。这是缓存常量内存的理想访问模式:每次将一个元素放入缓存时,它至少会被当前一代设备的 warp 中的所有 32 个线程使用。这意味着对于常量内存的每 32 次访问,缓存将至少服务其中 31 次。这使得缓存能够有效地消除 96% 或更多的对常量内存的访问。更好的是,每次从缓存访问常量时,都可以将其广播到扭曲中的所有线程。这意味着不会因访问缓存时的任何存储体冲突而产生延迟。这使得常量内存几乎与访问k空间元素的寄存器一样高效。3
An analysis of the loop in Figure 11.11 reveals that the k-space elements are indeed excellent candidates for constant memory. The index used for accessing kx, ky, and kz is m. m is independent of threadIdx, which implies that all threads in a warp will be accessing the same element of kx, ky, and kz. This is an ideal access pattern for cached constant memory: every time an element is brought into the cache, it will be used at least by all 32 threads in a warp for a current generation device. This means that for every 32 accesses to the constant memory, at least 31 of them will be served by the cache. This allows the cache to effectively eliminate 96% or more of the accesses to the constant memory. Better yet, each time when a constant is accessed from the cache, it can be broadcast to all the threads in a warp. This means that no delays are incurred due to any bank conflicts in the access to the cache. This makes constant memory almost as efficient as registers for accessing k-space elements.3
然而,将k空间元素放入常量存储器中涉及到一个技术问题。回想一下,常量内存的容量为 64 KB。然而, k空间样本的大小可以大得多,达到数十万甚至数百万的量级。解决恒定内存容量限制的典型方法是将大数据集分解为 64 KB 或更小的块。开发人员必须重新组织内核,以便多次调用内核,每次调用内核仅消耗大数据集的一部分。事实证明,这对于cmpFHD内核来说非常容易。
There is, however, a technical issue involved in placing the k-space elements into the constant memory. Recall that constant memory has a capacity of 64 KB. However, the size of the k-space samples can be much larger, in the order of hundreds of thousands or even millions. A typical way of working around the limitation of constant memory capacity is to break down a large data set into chunks or 64 KB or smaller. The developer must reorganize the kernel so that the kernel will be invoked multiple times, with each invocation of the kernel consuming only a chunk of the large data set. This turns out to be quite easy for the cmpFHD kernel.
仔细检查图 11.11中的循环可以发现,所有线程都将按顺序遍历k空间数组。也就是说,网格中的所有线程在每次迭代期间访问相同的k空间元素。对于大型数据集,内核中的循环只是迭代更多次。这意味着我们可以将循环划分为多个部分,每个部分处理适合 64 KB 常量内存容量的k空间元素块。 4主机代码现在多次调用内核。每次主机调用内核时,它都会在调用内核函数之前将一个新块放入常量内存中。如图 11.12所示。 (对于更新的设备和 CUDA 版本,内核参数的const __restrict__声明使相应的输入数据在“只读数据”缓存中可用,这是获得与使用常量内存相同效果的更简单方法。)
A careful examination of the loop in Figure 11.11 reveals that all threads will sequentially march through the k-space arrays. That is, all threads in the grid access the same k-space element during each iteration. For large data sets, the loop in the kernel simply iterates more times. This means that we can divide up the loop into sections, with each section processing a chunk of the k-space elements that fit into the 64 KB capacity of the constant memory.4 The host code now invokes the kernel multiple times. Each time the host invokes the kernel, it places a new chunk into the constant memory before calling the kernel function. This is illustrated in Figure 11.12. (For more recent devices and CUDA versions, a const __restrict__ declaration of kernel parameters makes the corresponding input data available in the “read-only data” cache, which is a simpler way of getting the same effect as using constant memory.)
图 11.12将k空间数据分块以适合常量内存。
Figure 11.12 Chunking k-space data to fit into constant memory.
在图 11.12中,cmpFHd内核是从循环中调用的。该代码假设kx、ky和kz数组位于主机内存中。kx、ky和kz的维度由M给出。在每次迭代中,主机代码调用cudaMemcpy()函数将k空间数据块传输到设备常量内存中。然后调用内核来处理该块。请注意,当M不是CHUNK_SIZE的完美倍数时,主机代码将需要额外一轮cudaMemcpy()和一次内核调用来完成剩余的k空间数据。
In Figure 11.12, the cmpFHd kernel is called from a loop. The code assumes that kx, ky, and kz arrays are in the host memory. The dimensions of kx, ky, and kz are given by M. At each iteration, the host code calls the cudaMemcpy() function to transfer a chunk of the k-space data into the device constant memory. The kernel is then invoked to process the chunk. Note that when M is not a perfect multiple of CHUNK_SIZE, the host code will need to have an additional round of cudaMemcpy() and one more kernel invocation to finish the remaining k-space data.
图 11.13显示了修改后的内核,它从常量内存访问k空间数据。请注意,指向kx、ky和kz的指针不再出现在内核函数的参数列表中。由于我们无法使用指针访问常量内存中的变量,因此kx_c、ky_c和kz_c数组将作为在__constant__关键字下声明的全局变量进行访问,如图11.12所示。通过从常量高速缓存访问这些元素,内核现在实际上只有四次对rMu和iMu数组的全局内存访问。编译器通常会识别出四个数组访问仅针对两个位置。它只会执行两次全局访问,一次访问rMu[m],一次访问iMu[m]。这些值将存储在临时寄存器变量中,以供其他两个变量使用。这使得最终的存储器访问次数为两次。计算与内存访问比例高达 13:2。这仍然不是理想的 10:1 比率,但足够高,内存带宽限制不再是限制性能的唯一因素。正如我们将看到的,我们可以执行一些其他优化,使计算更加高效并进一步提高性能。
Figure 11.13 shows the revised kernel that accesses the k-space data from constant memory. Note that pointers to kx, ky, and kz are no longer in the parameter list of the kernel function. Since we cannot use pointers to access variables in the constant memory, the kx_c, ky_c, and kz_c arrays are accessed as global variables declared under the __constant__ keyword as shown Figure 11.12. By accessing these elements from the constant cache, the kernel now has effectively only four global memory accesses to the rMu and iMu arrays. The compiler will typically recognize that the four array accesses are made to only two locations. It will only perform two global accesses, one to rMu[m] and one to iMu[m]. The values will be stored in temporary register variables for use in the other two. This makes the final number of memory accesses two. The compute to memory access ratio is up to 13:2. This is still not quite the desired 10:1 ratio but is sufficiently high that the memory bandwidth limitation is no longer the only factor that limits performance. As we will see, we can perform a few other optimizations that make computation more efficient and further improve performance.
图 11.13修改后的 F H D 内核以使用常量内存。
Figure 11.13 Revised FHD kernel to use constant memory.
如果我们运行图 11.12和11.13中的代码,我们会发现某些设备的性能增强并没有我们预期的那么高。事实证明,这些图中显示的代码并没有像我们预期的那样减少内存带宽。原因是常量缓存对于代码来说表现不佳。这与常量缓存的设计和k空间数据的内存布局有关。如图11.14所示,每个常量缓存条目被设计为存储多个连续的字。这种设计降低了恒定缓存硬件的成本。如果每个线程使用的多个数据元素不是连续的字,如图11.14(a)所示,它们最终将占用多个缓存条目。由于成本限制,常量缓存只有很少数量的条目。如图11.12和11.13所示,k空间数据存储在三个数组中:kx_c、ky_c和kz_c。在循环的每次迭代期间,需要常量高速缓存的三个条目来保存正在处理的三个k空间元素。由于不同的扭曲可能处于非常不同的迭代,因此它们可能总共需要许多条目。事实证明,某些设备中的恒定缓存容量可能不足以为 SM 中所有活动的 warp 提供足够数量的条目。
If we ran the code in Figures 11.12 and 11.13, we would have found out that the performance enhancement was not as high as we expected for some devices. As it turns out, the code shown in these figures does not result in as much memory bandwidth reduction as we expected. The reason is that the constant cache does not perform very well for the code. This has to do with the design of the constant cache and the memory layout of the k-space data. As shown in Figure 11.14, each constant cache entry is designed to store multiple consecutive words. This design reduces the cost of constant cache hardware. If multiple data elements that are used by each thread are not in consecutive words, as illustrated in Figure 11.14(a), they will end up taking multiple cache entries. Due to cost constraints, the constant cache has only a very small number of entries. As shown in Figures 11.12 and 11.13, the k-space data is stored in three arrays: kx_c, ky_c, and kz_c. During each iteration of the loop, three entries of the constant cache are needed to hold the three k-space elements being processed. Since different warps can be at very different iterations, they may require many entries altogether. As it turns out, the constant cache capacity in some devices may not be sufficient to provide a sufficient number of entries for all the warps active in an SM.
图 11.14 k空间数据布局对恒定缓存效率的影响:(a) k空间数据存储在单独的数组中,(b) k空间数据存储在元素为结构的数组中。
Figure 11.14 Effect of k-space data layout on constant cache efficiency: (a) k-space data stored in separate arrays, and (b) k-space data stored in an array whose elements are structs.
缓存条目使用效率低下的问题在文献中已经得到了很好的研究,并且可以通过调整k空间数据的内存布局来解决。该解决方案如图11.14(b)所示,基于该解决方案的代码如图11.15所示。该解决方案不是将k空间数据的x、y和z分量存储在三个单独的数组中,而是将这些分量存储在一个数组中,其中的元素构成一个结构体。在文献中,这种声明风格通常称为结构数组。数组的声明显示在顶部图 11.15。通过将x、y和z分量存储在数组元素的三个字段中,开发人员强制将这些分量存储在常量存储器的连续位置中。因此,迭代使用的所有三个组件现在都可以放入一个缓存条目中,从而减少支持所有活动扭曲执行所需的条目数量。请注意,由于我们只有一个数组来保存所有k空间数据,因此我们可以仅使用一个cudaMemcpyToSymbol将整个块复制到设备常量内存。传输的大小从4*CHUNK_SIZE调整为12*CHUNK_SIZE,以反映一次cudaMemcpy调用中所有三个组件的传输。
The problem of inefficient use of cache entries has been well studied in the literature and can be solved by adjusting the memory layout of the k-space data. The solution is illustrated in Figure 11.14(b) and the code based on this solution is shown in Figure 11.15. Rather than having the x, y, and z components of the k-space data stored in three separate arrays, the solution stores these components in an array of which the elements comprise a struct. In the literature, this style of declaration is often referred to as array of structs. The declaration of the array is shown at the top of Figure 11.15. By storing the x, y, and z components in the three fields of an array element, the developer forces these components to be stored in consecutive locations of the constant memory. Therefore, all three components used by an iteration can now fit into one cache entry, reducing the number of entries needed to support the execution of all the active warps. Note that since we have only one array to hold all k-space data, we can just use one cudaMemcpyToSymbol to copy the entire chunk to the device constant memory. The size of the transfer is adjusted from 4∗CHUNK_SIZE to 12∗CHUNK_SIZE to reflect the transfer of all the three components in one cudaMemcpy call.
图11.15调整k空间数据布局以提高缓存效率。
Figure 11.15 Adjusting k-space data layout to improve cache efficiency.
使用新的数据结构布局,我们还需要修改内核,以便根据新的布局进行访问。新内核如图11.16所示。请注意,kx[m]已变为k[m].x,ky[m]已变为k[m].y,依此类推。正如我们稍后将看到的,对代码的这一微小更改可以显着提高其执行速度。
With the new data structure layout, we also need to revise the kernel so that the access is done according to the new layout. The new kernel is shown in Figure 11.16. Note that kx[m] has become k[m].x, ky[m] has become k[m].y, and so on. As we will see later, this small change to the code can result in significant enhancement of its execution speed.
图11.16调整F H D 内核中的k 空间数据内存布局。
Figure 11.16 Adjusting the k-space data memory layout in the FHD kernel.
CUDA 提供数学函数的硬件实现,其吞吐量比相应的软件高得多。这些功能作为由SFU(特殊功能单元)执行的硬件指令来实现。使用这些功能的过程非常简单。对于cmpFHd内核,我们需要做的是更改将sin()和cos()函数调用到其硬件版本中:__sin()和__cos()。这些是编译器识别并翻译成 SFU 指令的内部函数。由于这些函数是在大量执行的循环体中调用的,因此我们预计这一更改将带来非常显着的性能改进。生成的cmpFHd内核如图11.17所示。
CUDA offers hardware implementations of mathematic functions that provide much higher throughput than their software counterparts. These functions are implemented as hardware instructions executed by the SFUs (special function units). The procedure for using these functions is quite easy. In the case of the cmpFHd kernel, what we need to do is change the calls to sin() and cos() functions into their hardware versions: __sin() and __cos(). These are intrinsic functions that are recognized by the compiler and translated into SFU instructions. Because these functions are called in a heavily executed loop body, we expect that the change will result in a very significant performance improvement. The resulting cmpFHd kernel is shown in Figure 11.17.
图 11.17使用硬件__sin()和__cos()函数。
Figure 11.17 Using hardware __sin() and __cos() functions.
但是,我们需要注意从软件功能切换到硬件功能时精度的降低。正如我们在第 7 章中讨论的,目前硬件实现的准确性略低于软件库(详细信息可在CUDA 编程指南中找到)。对于 MRI,我们需要确保硬件实现通道提供足够的精度,如图11.18所示。测试过程涉及“完美”图像(I 0)。我们使用相反的过程来生成相应的“扫描” k空间数据并进行合成。然后合成的扫描数据由所提出的重建系统处理以生成重建图像(I)。然后将完美图像和重建图像中的体素值输入峰值信噪比 (PSNR) 公式,如图11.18所示。
However, we need to be careful about the reduced accuracy when switching from software functions to hardware functions. As we discussed in Chapter 7, hardware implementations currently have slightly less accuracy than software libraries (the details are available in the CUDA Programming Guide). In the case of MRI, we need to make sure that the hardware implementation passes provide enough accuracy, as shown in Figure 11.18. The testing process involves a “perfect” image (I0). We use a reverse process to generate a corresponding “scanned” k-space data that is synthesized. The synthesized scanned data is then processed by the proposed reconstruction system to generate a reconstructed image (I). The values of the voxels in the perfect and reconstructed images are then fed into the peak signal-to-noise ratio (PSNR) formula shown in Figure 11.18.
图 11.18用于验证硬件功能准确性的指标。I 0是完美图像。I是重建图像。 PSNR 是峰值信噪比。
Figure 11.18 Metrics used to validate the accuracy of hardware functions. I0 is perfect image. I is reconstructed image. PSNR is Peak signal-to-noise ratio.
通过测试的标准取决于图像的预期应用。在我们的案例中,我们与临床 MRI 专家合作,确保硬件功能引起的 PSNR 变化完全在范围内。其应用程序可接受的限制。在医生使用图像形成损伤印象或评估疾病的应用中,还需要对图像质量进行目视检查。图 11.19显示了原始“真实”图像的视觉比较。然后显示 CPU 双精度和单精度实现实现的 PSNR 均为 27.6 dB,这是应用程序可接受的水平。目视检查还表明,重建图像确实与原始图像很好地对应。
The criteria for passing the test depend on the application that the image is intended for. In our case, we worked with experts in clinical MRI to ensure that the PSNR changes due to hardware functions are well within the accepted limits for their applications. In applications where the images are used by physicians to form an impression of injury or evaluate a disease, one also needs to have visual inspection of the image quality. Figure 11.19 shows the visual comparison of the original “true” image. It then shows that the PSNR achieved by CPU double-precision and single-precision implementations are both 27.6 dB, an acceptable level for the application. A visual inspection also shows that the reconstructed image indeed corresponds well with the original image.
图 11.19不同 F H D 实现的浮点精度和准确度验证。
Figure 11.19 Validation of floating-point precision and accuracy of the different FHD implementations.
与简单的双线性插值网格化/iFFT 相比,迭代重建的优势在图 11.19中也很明显。使用简单网格/iFFT重建的图像的PSNR仅为16.8 dB,远低于迭代获得的27.6 dB的PSNR重建方法。对图 11.19(2)中网格/iFFT 图像的目视检查表明,存在严重的伪影,可能会显着影响图像用于诊断目的的可用性。这些伪影不会出现在迭代重建方法的图像中。
The advantage of iterative reconstruction compared to a simple bilinear interpolation gridding/iFFT is also obvious in Figure 11.19. The image reconstructed with the simple gridding/iFFT has a PSNR of only 16.8 dB, substantially lower than the PSNR of 27.6 dB achieved by the iterative reconstruction method. A visual inspection of the gridding/iFFT image in Figure 11.19(2) shows that there are severe artifacts that can significantly impact the usability of the image for diagnostic purposes. These artifacts do not occur in the images from the iterative reconstruction method.
当我们在 CPU 上从双精度运算转向单精度运算时,PSNR 没有明显下降,这保持在 27.6 分贝。当我们将三角函数从软件库转移到硬件单元时,我们观察到 PSNR 的下降可以忽略不计,从 27.6 dB 下降到 27.5 dB。 PSNR 的轻微损失在应用的可接受范围内。目视检查确认重建图像与原始图像相比没有明显的伪影。
When we moved from double-precision to single-precision arithmetic on the CPU, there was no measurable degradation of PSNR, which remains at 27.6 dB. When we moved the trigonometry function from the software library to the hardware units, we observed a negligible degradation of PSNR, from 27.6 dB to 27.5 dB. The slight loss of PSNR is within an acceptable range for the application. A visual inspection confirms that the reconstructed image does not have significant artifacts compared to the original image.
到目前为止,我们尚未确定内核配置参数的适当值。例如,我们需要确定每个块的最佳线程数。一方面,需要在一个块中使用大量线程,以充分利用每个SM的线程容量(假设每个SM最多可以分配16个块)。另一方面,每个块中拥有更多线程会增加每个块的寄存器使用量,并且可以减少可装入 SM 的块的数量。每个块的线程数的一些可能值为 32、64、128、256 和 512。还可以考虑非 2 的幂数。
Up to this point, we have not determined the appropriate values for the configuration parameters for the kernel. For example, we need to determine the optimal number of threads for each block. On one hand, using a large number of threads in a block is needed to fully utilize the thread capacity of each SM (given that 16 blocks can be assigned to each SM at maximum). On the other hand, having more threads in each block increases the register usage of each block and can reduce the number of blocks that can fit into an SM. Some possible values of number of threads per block are 32, 64, 128, 256, and 512. One could also consider non-power-of-two numbers.
另一个内核配置参数是展开for循环体的次数。这可以使用#pragma unroll后跟我们希望编译器在循环上执行的展开次数来设置。一方面,展开循环可以减少开销指令的数量,并且潜在地减少处理每个k空间样本数据的时钟周期的数量。另一方面,过多的展开可能会增加寄存器的使用并减少 SM 中可容纳的块的数量。
Another kernel configuration parameter is the number of times one should unroll the body of the for loop. This can be set using a #pragma unroll followed by the number of unrolls we want the compiler to perform on a loop. On one hand, unrolling the loop can reduce the number of overhead instructions, and potentially reduce the number of clock cycles to process each k-space sample data. On the other hand, too much unrolling can potentially increase the usage of registers and reduce the number of blocks that can fit into an SM.
请注意,这些配置的影响并不是相互隔离的。增加一个参数值可能会使用可用于增加另一参数值的资源。因此,需要通过实验的方式联合评估这些参数。也就是说,可能需要更改每个联合配置的源代码并测量运行时间。可以有大量的源代码版本可供尝试。在 F H D的情况下,与仅探索一些有希望的趋势的启发式调整搜索工作相比,通过系统地搜索所有组合并选择具有最佳测量运行时间的组合,性能提高了约 20%。柳等人。提出了一种基于帕累托最优曲线的方法来筛选掉大多数较差的组合[RRS2008]。
Note that the effects of these configurations are not isolated from each other. Increasing one parameter value can potentially use the resource that could be used to increase another parameter value. As a result, one needs to evaluate these parameters jointly in an experimental manner. That is, one may need to change the source code for each joint configuration and measure the runtime. There can be a large number of source code versions to try. In the case of FHD, the performance improves about 20% by systematically searching all the combinations and choosing the one with the best measured runtime, as compared to a heuristic tuning search effort that only explores some promising trends. Ryoo et al. present a pareto optimal curve–based method to screen away most of the inferior combinations [RRS2008].
为了获得合理的基准,我们在 CPU 上实现了两个版本的FHD 。 CPU.DP 版本对所有浮点值和运算使用双精度,而 CPU.SP 版本使用单精度。两个 CPU 版本均使用标志 -O3 -msse3 -axT -vec-report3 -fp-model fast=2 使用 Intel ICC(版本 10.1)进行编译,其中 (1) 使用针对 Core 2 架构调整的指令对算法的主导循环进行矢量化, (2) 将三角运算链接到数学库中的快速近似函数。基于对较小数据集的实验调整,内部循环展开了四倍,并且扫描数据被平铺以改善 L1 缓存中的局部性。
To obtain a reasonable baseline, we implemented two versions of FHD on the CPU. Version CPU.DP uses double precision for all floating-point values and operations, while version CPU.SP uses single precision. Both CPU versions are compiled with Intel’s ICC (version 10.1) using flags -O3 -msse3 -axT -vec-report3 -fp-model fast=2, which (1) vectorizes the algorithm’s dominant loops using instructions tuned for the Core 2 architecture, and (2) links the trigonometric operations to fast, approximate functions in the math library. Based on experimental tuning with a smaller data set, the inner loops are unrolled by a factor of four and the scan data is tiled to improve locality in the L1 cache.
F H D的每个 GPU 版本均使用 NVCC-O3(CUDA 版本 1.1)编译,并在 1.35 GHz Quadro FX5600 上执行。 Quadro 卡安装在具有 2.4 GHz 双插槽、双核 Opteron 2216 CPU 的系统中。每个核心都有 1 MB 二级缓存。 CPU 版本使用 p 线程在 2.66 GHz Core 2 Extreme 四核 CPU 的所有四个核心上执行,该 CPU 的峰值理论容量为每核心 21.2 GFLOPS 和 4 MB 二级缓存。这些 CPU 版本在 Core 2 Extreme 四核上的性能明显优于双插槽双核 Opteron。因此,我们将使用Core 2 Extreme四核结果作为CPU。
Each GPU version of FHD is compiled using NVCC-O3 (CUDA version 1.1) and executed on a 1.35 GHz Quadro FX5600. The Quadro card is housed in a system with a 2.4 GHz dual-socket, dual-core Opteron 2216 CPU. Each core has a 1 MB L2 cache. The CPU versions use p-threads to execute on all four cores of a 2.66 GHz Core 2 Extreme quad-core CPU, which has peak theoretical capacity of 21.2 GFLOPS per core and a 4 MB L2 cache. The CPU versions perform substantially better on the Core 2 Extreme quad-core than on the dual-socket, dual-core Opteron. Therefore, we will use the Core 2 Extreme quad-core results for the CPU.
所有重建均使用 CPU 版本的线性解算器,该解算器在 Quadro FX5600 上执行 60 次迭代。在 Core 2 Extreme 上计算了两个版本的 Q,一个使用双精度,另一个使用单精度。单精度Q用于所有基于GPU的重建以及涉及CPU.SP的重建,而双精度Q仅用于涉及CPU.DP的重建。由于 Q 的计算不在重构的关键路径上,因此我们不再考虑 Q。
All reconstructions use the CPU version of the linear solver, which executes 60 iterations on the Quadro FX5600. Two versions of Q were computed on the Core 2 Extreme, one using double precision and the other using single precision. The single-precision Q was used for all GPU-based reconstructions and for the reconstruction involving CPU.SP, while the double-precision Q was used only for the reconstruction involving CPU.DP. As the computation of Q is not on the reconstruction’s critical path, we give Q no further consideration.
为了便于将迭代重建与传统重建进行比较,我们还评估了基于双线性插值网格和逆 FFT 的重建。我们的网格重建版本并未针对性能进行优化,但它已经相当快了。
To facilitate comparison of the iterative reconstruction with a conventional reconstruction, we also evaluated a reconstruction based on bilinear interpolation gridding and inverse FFT. Our version of the gridded reconstruction is not optimized for performance, but it is already quite fast.
所有重建都是根据从模型图像的模拟 3D、非笛卡尔扫描获得的样本数据进行的。扫描数据集中有 284,592 个样本点,图像以 1,283 分辨率重建,总共 221 个体素。在第一组实验中,模拟数据不包含噪声。在第二组实验中,我们在模拟数据中添加了复杂的高斯白噪声。在确定重建图像的质量时,误差百分比和使用 PSNR 指标。百分比误差是体素误差的均方根 (RMS) 除以真实图像中的 RMS 体素值(以 1,283 分辨率对真实图像进行采样之后)。
All reconstructions are performed on sample data obtained from a simulated, 3D, non-Cartesian scan of a phantom image. There are 284,592 sample points in the scan data set, and the image is reconstructed at 1,283 resolution, for a total of 221 voxels. In the first set of experiments, the simulated data contains no noise. In the second set of experiments, we added complex white Gaussian noise to the simulated data. When determining the quality of the reconstructed images, the percent error and PSNR metrics are used. The percent error is the root-mean-square (RMS) of the voxel error divided by the RMS voxel value in the true image (after the true image has been sampled at 1,283 resolution).
数据(运行时间、GFLOPS 和图像)是通过使用前面描述的 F HD 算法的每个实现重建每个图像一次来获得的。这项政策有两个例外。对于 GPU.Tune 和 GPU.Multi,计算 F H D所需的时间非常短,以至于性能的运行时变化变得不可忽略。因此,对于这些配置,我们计算了 F H D 三次并报告了平均性能。
The data (runtime, GFLOPS, and images) was obtained by reconstructing each image once with each of the implementations of the FHD algorithm described before. There are two exceptions to this policy. For GPU.Tune and GPU.Multi, the time required to compute FHD is so small that runtime variations in performance became non-negligible. Therefore, for these configurations we computed FHD three times and reported the average performance.
如图11.20所示,在高端顺序 CPU 上,使用双线性插值网格和逆 FFT 的测试图像的总重建时间不到一分钟。这证实了并行化这种传统的重建策略几乎没有价值。然而,从图 11.19(2)中可以明显看出,生成的图像呈现出不可接受的伪影水平。
As shown in Figure 11.20, the total reconstruction time for the test image using bilinear interpolation gridding followed by inverse FFT takes less than one minute on a high-end sequential CPU. This confirms that there is little value in parallelizing this traditional reconstruction strategy. It is, however, obvious from Figure 11.19(2) that the resulting image exhibits an unacceptable level of artifacts.
图 11.20性能改进摘要。
Figure 11.20 Summary of performance improvements.
LS(CPU,DP)行显示在CPU上使用双精度浮点运算重建测试图像的执行时序。时序显示了计算 F H F+λW H W 的核心步骤 ( Q ) 。第一个观察结果是,基于中等大小数据样本的中等分辨率图像的Q计算花费了不可接受的时间(超过 65小时)在 CPU 上为患者设置系统。请注意,通过第 11.3 节中描述的所有优化,该时间最终在 GPU 上减少到 6.5 分钟。第二个观察结果是,每幅图像的总重建时间需要超过 8 小时,其中线性求解器仅花费 1.59 分钟。这验证了我们将并行化工作重点放在 FHD 上的决定。
The LS (CPU, DP) row shows the execution timing of reconstructing the test image using double-precision, floating-point arithmetic on the CPU. The timing shows the core step (Q) of calculating FHF+λWHW. The first observation is that the Q computation for a moderate resolution image based on a moderate-size data sample takes an unacceptable amount of time (more than 65 hours) on the CPU for setting up the system for a patient. Note that this time is eventually reduced to 6.5 minutes on the GPU with all the optimizations described in Section 11.3. The second observation is that the total reconstruction time of each image requires more than 8 hours, with only 1.59 minutes spent in the linear solver. This validates our decision to focus our parallelization effort on FHD.
LS(CPU,SP)行表明,当我们将计算从双精度浮点运算转换为CPU上的单精度时,我们可以显着减少执行时间。这是因为流SIMD指令(SSE)指令具有更高的吞吐量,也就是说,它们在单精度模式下执行时每个时钟周期计算更多的数据元素。然而,执行时间对于实际使用来说仍然是不可接受的。
The LS (CPU, SP) row shows that we can reduce the execution time significantly when we convert the computation from double-precision, floating-point arithmetic to single precision on the CPU. This is because the streaming SIMD instruction (SSE) instructions have higher throughput, that is, they calculate more data elements per clock cycle when executing in single-precision mode. The execution times, however, are still unacceptable for practical use.
LS(GPU,Naïve)行表明,简单的 CUDA 实现可以使Q加速约 10 倍,每个图像的重建加速约 8 倍。这是一个很好的加速,但最终的执行时间对于实际使用来说仍然是不可接受的。
The LS (GPU, Naïve) row shows that a straightforward CUDA implementation can achieve a speedup about 10× for Q and 8× for the reconstruction of each image. This is a good speedup, but the resulting execution times are still unacceptable for practical use.
LS(GPU、CMem)行显示,通过使用寄存器和常量缓存来绕过全局内存带宽限制,可以进一步显着加速。这些增强功能比原始 CUDA 代码实现了大约 4 倍的加速!这表明在 CUDA 内核中实现最佳计算与内存比率的重要性。这些增强功能使 CUDA 代码比单精度 CPU 代码加速约 40 倍。
The LS (GPU, CMem) row shows that significant further speedup is achieved by using registers and constant cache to get around the global memory bandwidth limitations. These enhancements achieve about 4× speedup over the naïve CUDA code! This shows the importance of achieving optimal compute to memory ratios in CUDA kernels. These enhancements bring the CUDA code to about 40× speedup over the single-precision CPU code.
LS(GPU、CMem、SPU、Exp)行显示了硬件三角函数和实验调整的结合使用,并带来了显着的加速。图中未显示的单独实验表明,大部分加速来自硬件三角函数。相对于 CPU 单精度代码的总加速非常令人印象深刻:Q为 357 倍,每个图像的重建为 108 倍。
The LS (GPU, CMem, SPU, Exp) row shows the use of hardware trigonometry functions and experimental tuning together, and results in a dramatic speedup. A separate experiment, not shown in the figure, shows that most of the speedup comes from hardware trigonometry functions. The total speedup over CPU single-precision code is very impressive: 357× for Q and 108× for the reconstruction of each image.
一个有趣的观察是,最终,线性求解器实际上比 F H D 花费更多的时间。这是因为我们大幅加速了 F H D (228×)。过去接近 100% 的每幅图像重建时间现在已不足 50%。现在,任何进一步的加速都需要线性求解器的加速,这是一种对于大规模并行执行来说更加困难的计算类型。
An interesting observation is that in the end, the linear solver actually takes more time than FHD. This is because we have accelerated FHD dramatically (228×). What used to be close to 100% of the per-image reconstruction time now accounts for less than 50%. Any further acceleration will now require acceleration of the linear solver, a much more difficult type of computation for massively parallel execution.
11.1. 循环裂变将一个循环分成两个循环。使用图11.4(b)中的F H D代码,枚举两部分的执行顺序外循环体:(1)内循环之前的语句和(2)内循环。
11.1. Loop fission splits a loop into two loops. Use the FHD code in Figure 11.4(b) and enumerate the execution order of the two parts of the outer loop body: (1) the statements before the inner loop and (2) the inner loop.
(a) List the execution order of these parts from different iterations of the outer loop before fission.
(二) 从裂变后的两个循环中列出这些部分的执行顺序。确定执行结果是否相同。如果在该部分执行之前正确生成并保存了该部分所需的所有数据以供其使用,并且该部分的执行结果没有被按原始执行顺序位于该部分之后的其他部分覆盖,则执行结果是相同的。
(b) List the execution order of these parts from the two loops after fission. Determine if the execution results will be identical. The execution results are identical if all data required by a part is properly generated and preserved for its consumption before that part executes, and the execution result of the part is not overwritten by other parts that should come after the part in the original execution order.
11.2. 循环互换将内循环交换为外循环,反之亦然。使用图 11.9中的循环并枚举循环交换之前和之后循环体实例的执行顺序。
11.2. Loop interchange swaps the inner loop into the outer loop and vice versa. Use the loops from Figure 11.9 and enumerate the execution order of the instances of the loop body before and after the loop exchange.
(A) 列出循环交换之前不同迭代的循环体的执行顺序。使用m和n的值标识这些迭代。
(a) List the execution order of the loop body from different iterations before the loop interchange. Identify these iterations with the values of m and n.
(二) 列出循环互换后不同迭代的循环体的执行顺序。使用 m 和 n 的值标识这些迭代。
(b) List the execution order of the loop body from different iterations after the loop interchange. Identify these iterations with the values of m and n.
(C) 确定 (a) 和 (b) 执行结果是否相同。如果在该部分执行之前正确生成并保存了该部分所需的所有数据以供其使用,并且该部分的执行结果没有被按原始执行顺序位于该部分之后的其他部分覆盖,则执行结果是相同的。
(c) Determine if the (a) and (b) execution results will be identical. The execution results are identical if all data required by a part is properly generated and preserved for its consumption before that part executes and the execution result of the part is not overwritten by other parts that should come after the part in the original execution order.
11.3。 在图 11.11中,确定对x[]和kx[]的访问在所使用的索引的性质上的差异。使用差异来解释为什么尝试将kx[n]加载到内核寄存器中没有意义,如图 11.11所示。
11.3. In Figure 11.11, identify the difference between the access to x[] and kx[] in the nature of indices used. Use the difference to explain why it does not make sense to try to load kx[n] into a register for the kernel shown in Figure 11.11.
11.4。 在一次会议上,一位新研究生告诉他的导师,他通过使用cudaMalloc()分配常量内存并使用cudaMemcpy()将只读数据从 CPU 内存传输到常量内存来提高内核性能。如果你是他的顾问,你会有何反应?
11.4. During a meeting, a new graduate student told his advisor that he improved his kernel performance by using cudaMalloc() to allocate constant memory and by using cudaMemcpy() to transfer read-only data from the CPU memory to the constant memory. If you were his advisor, what would be your response?
1.Liang ZP,Lauterbur P。磁共振成像原理:信号处理视角纽约:John Wiley and Sons; 1999 年。
1. Liang ZP, Lauterbur P. Principles of Magnetic Resonance Imaging: A Signal Processing Perspective New York: John Wiley and Sons; 1999.
2. Haldar JP,Hernando D,Budde MD,王Q,宋SK,梁。 ZP。高分辨率 MR 代谢成像。参见 IEEE EMBS 2007:4324–4326。
2. Haldar JP, Hernando D, Budde MD, Wang Q, Song S-K, Liang. Z-P. High-resolution MR metabolic imaging. In Proc IEEE EMBS 2007:4324–4326.
3. Ryoo S、Ridrigues CI、Stone SS 等人。针对 GPU 计算的程序优化雕刻。并行与分布式计算杂志2008;。 doi 10.1016/j.jpdc.2008.05.011。
3. Ryoo S, Ridrigues CI, Stone SS, et al. Program optimization carving for GPU computing. Journal of Parallel and Distributed Computing 2008;. doi 10.1016/j.jpdc.2008.05.011.
4. SS Stone、JP Haldar、SC Tsao、WW Hwu、BP Sutton 和 ZP Liang,在 GPU 上加速高级 MRI 重建,并行与分布式计算杂志,2008 年,doi:10.1016/j.jpdc.2008.05.013。
4. S. S. Stone, J. P. Haldar, S. C. Tsao, W. W. Hwu, B. P. Sutton, and Z. P. Liang, Accelerating advanced MRI reconstruction on GPUs, Journal of Parallel and Distributed Computing, 2008, doi:10.1016/j.jpdc.2008.05.013.
1请注意,F HD计算可以通过网格进行近似,并且可以在几秒钟内运行,但最终重建图像的质量可能会降低。
1Note that the FHD computation can be approximated with gridding and can run in a few seconds, with perhaps reduced quality of the final reconstructed image.
2请注意,将x[]、 y[]、 z[]、 rFhD[]和iFhD[]声明为自动数组将无法满足我们的目的。这样的声明将在每个线程的本地内存中创建所有这五个数组的私有副本!我们想要的只是在每个线程的寄存器中拥有每个数组的一个元素的私有副本。
2Note that declaring x[], y[], z[], rFhD[], and iFhD[] as automatic arrays will not work for our purpose here. Such declaration would have created private copies of all these five arrays in the local memory of every thread! All we want is to have a private copy of one element of each array in the registers of each thread.
3常量内存访问效率不如寄存器访问的原因是,访问常量内存仍然需要内存加载指令。
3The reason why a constant memory access is not exactly as efficient as a register access is that a memory load instruction is still needed for access to the constant memory.
4请注意,并非所有对只读数据的访问都像我们这里的那样有利于常量内存。在第 12 章中,我们介绍了不同块中的线程在同一迭代中访问不同元素的情况。这种更加分散的访问模式使得将足够的数据放入常量内存以供内核启动变得更加困难。
4Note not all accesses to read-only data are as favorable for constant memory as what we have here. In Chapter 12 we present a case where threads in different blocks access different elements in the same iteration. This more diverged access pattern makes it much harder to fit enough of the data into the constant memory for a kernel launch.
12.1 应用背景
12.1 Application Background
12.2 简单的内核实现
12.2 A Simple Kernel Implementation
12.3 线程粒度调整
12.3 Thread Granularity Adjustment
12.4 内存合并
12.4 Memory Coalescing
12.5 概括
12.5 Summary
12.6 练习
12.6 Exercises
前面的案例说明了选择合适的循环嵌套级别进行并行执行的过程,使用常量内存来放大只读数据的内存带宽,使用寄存器来减少内存带宽的消耗,以及使用特殊的硬件功能单元来加速三角函数。在本案例研究中,我们使用基于常规网格数据结构的应用程序来说明如何使用其他实用技术来实现全局内存访问合并并提高计算吞吐量。我们提出了一系列静电势图计算内核的实现,每个版本都对前一个版本进行了改进。每个版本都采用一种或多种实用技术。该应用程序的这种计算模式是大规模并行计算的最佳匹配模式之一。该应用案例研究表明,有效使用这些实用技术可以显着提高执行吞吐量,对于应用程序实现其潜在性能至关重要。
The previous case study illustrated the process of selecting an appropriate level of a loop nest for parallel execution, the use of constant memory for magnifying the memory bandwidth for read-only data, the use of registers to reduce the consumption of memory bandwidth, and the use of special hardware functional units to accelerate trigonometry functions. In this case study, we use an application based on regular grid data structures to illustrate the use of additional practical techniques that achieve global memory access coalescing and improved computation throughput. We present a series of implementations of an electrostatic potential map calculation kernel, with each version improving upon the previous one. Each version adopts one or more practical techniques. This computation pattern of this application is one of the best matches for massively parallel computing. This application case study shows that the effective use of these practical techniques can significantly improve the execution throughput and is critical for the application to achieve its potential performance.
本案例研究基于 VMD(视觉分子动力学)[HDS1996],这是一种流行的软件系统,设计用于显示、动画和分析生物分子系统。截至2012年,VMD拥有超过20万注册用户。虽然它对分析生物分子系统有强大的内置支持,例如计算分子系统空间网格点的静电势值,但它也是显示其他大型数据集的流行工具,例如测序数据、量子化学模拟数据和体积数据,由于其多功能性和用户可扩展性。
This case study is based on VMD (Visual Molecular Dynamics) [HDS1996], a popular software system designed for displaying, animating, and analyzing biomolecular systems. As of 2012, VMD has more than 200,000 registered users. While it has strong built-in support for analyzing biomolecular systems, such as calculating electrostatic potential values at spatial grid points of a molecular system, it has also been a popular tool for displaying other large data sets, such as sequencing data, quantum chemistry simulation data, and volumetric data, due to its versatility and user extensibility.
虽然 VMD 设计为在各种硬件(笔记本电脑、台式机、集群和超级计算机)上运行,但大多数用户将 VMD 用作交互式 3D 可视化和分析的桌面科学应用程序。对于交互使用时运行时间过长的计算,VMD 还可以以批处理模式使用来渲染电影以供以后使用。加速 VMD 的一个动机是使批处理模式作业足够快以供交互式使用。这可以极大地提高科学研究的生产力。随着 CUDA 设备在台式电脑中广泛使用,这种加速可以对 VMD 用户社区产生广泛的影响。迄今为止,VMD 的多个方面已通过 CUDA 得到加速,包括静电势计算、离子放置、分子轨道计算和显示以及蛋白质中气体迁移路径的成像。
While VMD is designed to run on a diverse range of hardware—laptops, desktops, clusters, and supercomputers—most users use VMD as a desktop science application for interactive 3D visualization and analysis. For computation that runs too long for interactive use, VMD can also be used in a batch mode to render movies for later use. A motivation for accelerating VMD is to make batch mode jobs fast enough for interactive use. This can drastically improve the productivity of scientific investigations. With CUDA devices widely available in desktop PCs, such acceleration can have broad impact on the VMD user community. To date, multiple aspects of VMD have been accelerated with CUDA, including electrostatic potential calculation, ion placement, molecular orbital calculation and display, and imaging of gas migration pathways in proteins.
本案例研究中使用的特定计算是网格空间中静电势图的计算。此计算通常用于将离子放置到分子动力学模拟的结构中。图 12.1显示了离子在蛋白质结构中的位置,为分子动力学模拟做准备。在此应用中,静电势图用于根据物理定律识别离子(大分子周围的圆点)可以适应的空间位置。该函数还可用于计算分子动力学模拟过程中的时间平均势,这对于模拟过程以及模拟结果的可视化/分析非常有用。
The particular calculation used in this case study is the calculation of electrostatic potential maps in a grid space. This calculation is often used in placement of ions into a structure for molecular dynamics simulation. Figure 12.1 shows the placement of ions into a protein structure in preparation for molecular dynamics simulation. In this application, the electrostatic potential map is used to identify spatial locations where ions (round dots around the large molecules) can fit in according to physical laws. The function can also be used to calculate time-averaged potentials during molecular dynamics simulation, which is useful for the simulation process as well as the visualization/analysis of simulation results.
图 12.1静电势图用于构建分子动力学模拟的稳定结构。
Figure 12.1 Electrostatic potential map is used in building stable structures for molecular dynamics simulation.
有多种计算静电势图的方法。其中,直接库仑求和(DCS)是一种高精度方法,特别适合GPU [SPF2007]。 DCS方法将每个网格点的静电势值计算为系统中所有原子的贡献之和。如图 12.2所示。原子i对晶格点j的贡献是该原子的电荷除以晶格点j到原子i的距离。从此需要对所有网格点和所有原子进行计算,计算次数与系统中原子总数和网格点总数的乘积成正比。对于现实的分子系统,该乘积可能非常大。因此,静电势图的计算传统上是作为 VMD 中的批处理作业完成的。
There are several methods for calculating electrostatic potential maps. Among them, direct coulomb summation (DCS) is a highly accurate method that is particularly suitable for GPUs [SPF2007]. The DCS method calculates the electrostatic potential value of each grid point as the sum of contributions from all atoms in the system. This is illustrated in Figure 12.2. The contribution of atom i to a lattice point j is the charge of that atom divided by the distance from lattice point j to atom i. Since this needs to be done for all grid points and all atoms, the number of calculations is proportional to the product of the total number of atoms in the system and the total number of grid points. For a realistic molecular system, this product can be very large. Therefore, the calculation of the electrostatic potential map has been traditionally done as a batch job in VMD.
图 12.2原子[i]对晶格点j处的静电势(势能[j])的贡献为原子[i].电荷/ r ij。在DCS方法中,晶格点j处的总电势是系统中所有原子贡献的总和。
Figure 12.2 The contribution of atom[i] to the electrostatic potential at lattice point j (potential[j]) is atom[i].charge/rij. In the DCS method, the total potential at lattice point j is the sum of contributions from all atoms in the system.
图 12.3显示了 DCS 代码的基本 C 代码。该函数被编写为处理 3D 网格的 2D 切片。该函数将为建模空间的所有切片重复调用。该函数的结构非常简单,只有三层for循环。外部两个级别迭代网格点空间的y维度和x维度。对于每个网格点,最里面的for循环迭代所有原子,计算所有原子对网格点的静电势能的贡献。请注意,每个原子由atoms[]数组的四个连续元素表示。前三个元素存储原子的x、y和z坐标,第四个元素存储原子的电荷。在最内层循环结束时,将网格点的累加值写入网格数据结构。然后,外部循环迭代并将执行执行到下一个网格点。
Figure 12.3 shows the base C code of the DCS code. The function is written to process a 2D slice of a 3D grid. The function will be called repeatedly for all the slices of the modeled space. The structure of the function is quite simple with three levels of for loops. The outer two levels iterate over the y dimension and the x dimension of the grid point space. For each grid point, the innermost for loop iterates over all atoms, calculating the contribution of electrostatic potential energy from all atoms to the grid point. Note that each atom is represented by four consecutive elements of the atoms[] array. The first three elements store the x, y, and z coordinates of the atom and the fourth element the electrical charge of the atom. At the end of the innermost loop, the accumulated value of the grid point is written out to the grid data structure. The outer loops then iterate and take the execution to the next grid point.
图 12.3 2D 切片的基础库仑势计算代码。
Figure 12.3 Base coulomb potential calculation code for a 2D slice.
请注意,图 12.3中的 DCS 函数通过将网格点索引值乘以网格点之间的间距来实时计算每个网格点的x和y坐标。这是一种均匀网格方法,其中所有网格点在所有三个中都以相同的距离间隔。方面。该函数确实利用了同一切片中的所有网格点具有相同z坐标的事实。该值由函数的调用者预先计算并作为函数参数 ( z )传入。
Note that the DCS function in Figure 12.3 calculates the x and y coordinates of each grid point on-the-fly by multiplying the grid point index values by the spacing between grid points. This is a uniform grid method where all grid points are spaced at the same distance in all three dimensions. The function does take advantage of the fact that all the grid points in the same slice have the same z coordinate. This value is precalculated by the caller of the function and passed in as a function parameter (z).
根据我们从 MRI 案例研究中了解到的情况,DCS 方法的两个属性应该是显而易见的。首先,计算是大规模并行的:每个网格点的静电势计算独立于其他网格点。有两种组织并行执行的替代方法。在第一个选项中,我们可以使用每个线程来计算一个原子对所有网格点的贡献。这将是一个糟糕的选择,因为每个线程都将写入所有网格点,需要大量使用原子内存操作来协调不同线程对每个网格点所做的更新。第二个选项使用每个线程来计算所有原子对一个网格点的累积贡献。这是一种首选方法,因为每个线程都将写入自己的网格点,并且不需要使用原子操作。
Based on what we learned from the MRI case study, two attributes of the DCS method should be apparent. First, the computation is massively parallel: the computation of electrostatic potential for each grid point is independent of that of other grid points. There are two alternative approaches to organizing parallel execution. In the first option, we can use each thread to calculate the contribution of one atom to all grid points. This would be a poor choice since each thread would be writing to all grid points, requiring extensive use of atomic memory operations to coordinate the updates done by different threads to each grid point. The second option uses each thread to calculate the accumulated contributions of all atoms to one grid point. This is a preferred approach since each thread will be writing into its own grid point and there is no need to use atomic operations.
我们将形成一个与2D能量网格点组织相匹配的2D线程网格。为此,我们需要将两个外层循环修改为完美嵌套循环,以便我们可以使用每个线程执行两级循环的一次迭代。我们可以执行循环裂变,或者将y坐标的计算移到内部循环中。前者需要我们创建一个新数组来保存所有y值,并导致两个内核通过全局内存通信数据。后者增加了y坐标的计算次数。在这种情况下,我们选择执行后者,因为只有少量计算可以很容易地容纳到内循环中,而不会显着增加内循环的执行时间。前者会增加线程执行很少工作的内核的内核启动开销。所选转换允许并行执行所有i和j迭代。这是完成的计算量和实现的并行性水平之间的权衡。
We will form a 2D thread grid that matches the 2D energy grid point organization. To do so, we need to modify the two outer loops into perfectly nested loops so that we can use each thread to execute one iteration of the two-level loop. We can either perform a loop fission, or we move the calculation of the y coordinate into the inner loop. The former would require us to create a new array to hold all y values and result in two kernels communicating data through global memory. The latter increases the number of times that the y coordinate will be calculated. In this case, we choose to perform the latter since there is only a small amount of calculation that can be easily accommodated into the inner loop without a significant increase in execution time of the inner loop. The former would have added a kernel launch overhead for a kernel where threads do very little work. The selected transformation allows all i and j iterations to be executed in parallel. This is a trade-off between the amount of calculation done and the level of parallelism achieved.
我们可以从 MRI 案例研究中应用的第二个经验是,每个原子的电荷将被所有线程读取。这是因为每个原子对 DCS 方法中的每个网格点都有贡献。此外,原子电荷的值在计算过程中不会被修改。这意味着原子电荷值可以有效地存储在常量存储器中(图 12.4中的 GPU 框中)。
The second experience that we can apply from the MRI case study is that the electrical charge of every atom will be read by all threads. This is because every atom contributes to every grid point in the DCS method. Furthermore, the values of the atomic electrical charges are not modified during the computation. This means that the atomic charge values can be efficiently stored in the constant memory (in the GPU box in Figure 12.4).
图 12.4 DCS 内核设计概述。
Figure 12.4 Overview of the DCS kernel design.
图 12.4显示了 DCS 内核设计的概述。主机程序(图12.4中的主机框)输入并维护原子电荷及其在系统存储器中的坐标。它还维护网格系统内存中的点数据结构(主机箱左侧)。 DCS 内核设计用于处理能量网格点结构的 2D 切片(不要与线程网格混淆)。主机框中的右侧网格显示了 2D 切片的示例。对于每个 2D 切片,CPU 将其网格数据传输到设备全局内存。原子信息被分成块以适合常量存储器。对于原子信息的每个chunk,CPU将该chunk传输到设备常量内存中,调用DCS内核计算当前chunk对当前分片的贡献,并准备传输下一个chunk。当前切片的所有原子信息块处理完毕后,该切片被传回以更新CPU系统内存中的网格点数据结构。系统移动到下一个切片。
Figure 12.4 shows an overview of the DCS kernel design. The host program (host box in Figure 12.4) inputs and maintains the atomic charges and their coordinates in the system memory. It also maintains the grid point data structure in the system memory (left side in the host box). The DCS kernel is designed to process a 2D slice of the energy grid point structure (not to be confused with thread grids). The right side grid in the host box shows an example of a 2D slice. For each 2D slice, the CPU transfers its grid data to the device global memory. The atom information is divided into chunks to fit into the constant memory. For each chunk of the atom information, the CPU transfers the chunk into the device constant memory, invokes the DCS kernel to calculate the contribution of the current chunk to the current slice, and prepares to transfer the next chunk. After all chunks of the atom information have been processed for the current slice, the slice is transferred back to update the grid point data structure in the CPU system memory. The system moves on to the next slice.
在每个内核调用中,线程块被组织来计算网格结构的图块的静电势。在最简单的内核中,每个线程计算一个网格点的值。在更复杂的内核中,每个线程计算多个网格点并利用网格点计算之间的冗余来提高执行速度。这在图 12.4中标记为“线程块”的左侧部分中进行了说明,并且是第 6 章中讨论的粒度调整优化的示例。
Within each kernel invocation, the thread blocks are organized to calculate the electrostatic potential of tiles of the grid structure. In the simplest kernel, each thread calculates the value at one grid point. In more sophisticated kernels, each thread calculates multiple grid points and exploits the redundancy between the calculations of the grid points to improve execution speed. This is illustrated in the left side portion labeled as “thread blocks” in Figure 12.4 and is an example of the granularity adjustment optimization discussed in Chapter 6.
图 12.5显示了生成的 CUDA 内核代码。我们省略了一些声明。与 MRI 案例研究中的情况一样,atominfo[]数组由主机代码在常量内存中声明。主机代码还需要将原子信息划分为适合每个内核调用的常量内存的块。这意味着当存在多个原子块时,内核将被多次调用。由于这与 MRI 案例研究类似,因此我们不会显示详细信息。
Figure 12.5 shows the resulting CUDA kernel code. We omitted some of the declarations. As was the case in the MRI case study, the atominfo[] array is declared in the constant memory by the host code. The host code also needs to divide up the atom information into chunks that fit into the constant memory for each kernel invocation. This means that the kernel will be invoked multiple times when there are multiple chunks of atoms. Since this is similar to the MRI case study, we will not show the details.
图 12.5 DCS 内核版本 1。
Figure 12.5 DCS kernel version 1.
图 12.3中循环的外部两级已从内核代码中删除,并由内核调用中的执行配置参数替换。由于这也类似于我们在 MRI 案例研究中采取的步骤之一,因此我们不会显示内核调用,而是将其作为练习留给读者。内核代码的其余部分很简单,直接对应于最内层循环的原始循环体。
The outer two levels of the loop in Figure 12.3 have been removed from the kernel code and are replaced by the execution configuration parameters in the kernel invocation. Since this is also similar to one of the steps we took in the MRI case study, we will not show the kernel invocation but leave it as an exercise for readers. The rest of the kernel code is straightforward and corresponds directly to the original loop body of the innermost loop.
内核的一个特定方面有些微妙且值得一提。内核代码计算一大块原子对网格点的贡献。网格点必须保存在全局内存中,并由每次内核调用更新。这意味着内核需要读取当前网格点值,添加当前原子块的贡献,并将更新后的值写入全局内存。该代码尝试通过在内核开头加载网格值并在内核末尾使用它来隐藏全局内存延迟。这有帮助减少流式多处理器 (SM) 调度程序隐藏全局内存延迟所需的扭曲数量。
One particular aspect of the kernel is somewhat subtle and worth mentioning. The kernel code calculates the contribution of a chunk of atoms to a grid point. The grid point must be preserved in the global memory and updated by each kernel invocation. This means that the kernel needs to read the current grid point value, add the contributions by the current chunk of atoms, and write the updated value to global memory. The code attempts to hide the global memory latency by loading the grid value at the beginning of the kernel and using it at the end of the kernel. This helps to reduce the number of warps needed by the streaming multiprocessor (SM) scheduler to hide the global memory latency.
图 12.5中的内核性能非常好,在第一代 CUDA 设备 G80 上测得的性能为 186 GFLOPS。在应用程序级性能方面,该实现每秒可以处理 186 亿次原子评估。快速浏览一下代码就会发现,每个线程对每访问四个内存元素执行九次浮点运算。从表面上看,这并不是一个很好的比例。我们至少需要 8 的比率来避免全局内存拥塞。然而,所有四个内存访问都是针对atominfo[]数组完成的。每个原子的这些atominfo[]数组元素被缓存在每个SM中的硬件缓存存储器中,并且被广播到大量线程。与 MRI 案例研究中类似的计算表明,跨线程内存元素的大量重用使得常量缓存极其有效,使每次全局内存访问的浮点操作的有效比率远高于 10:1。因此,全局内存带宽不是该内核的限制因素。
The performance of the kernel in Figure 12.5 is quite good, measured at 186 GFLOPS on a G80, a first-generation CUDA device. In terms of application-level performance, the implementation can process 18.6 billion atom evaluations per second. A quick glance over the code shows that each thread does nine floating-point operations for every four memory elements accessed. On the surface, this is not a very good ratio. We need at least a ratio of 8 to avoid global memory congestion. However, all four memory accesses are done to the atominfo[] array. These atominfo[] array elements for each atom are cached in a hardware cache memory in each SM and are broadcast to a large number of threads. A similar calculation to that in the MRI case study shows that the massive reuse of memory elements across threads make the constant cache extremely effective, boosting the effective ratio of floating operations per global memory access much higher than 10:1. As a result, global memory bandwidth is not a limiting factor for this kernel.
尽管图12.5中的内核通过常量缓存避免了全局内存瓶颈,但它仍然需要为每执行九个浮点运算执行四个常量内存访问指令。这些存储器访问指令消耗了原本可以用来增加浮点指令的执行吞吐量的硬件资源。本节展示了我们可以将多个线程融合在一起,以便可以从常量内存中获取一次atominfo[]数据,将其存储到寄存器中并用于多个网格点。这个想法如图 12.6所示。
Although the kernel in Figure 12.5 avoids global memory bottleneck through constant caching, it still needs to execute four constant memory access instructions for every nine floating-point operations performed. These memory access instructions consume hardware resources that could be otherwise used to increase the execution throughput of floating-point instructions. This section shows that we can fuse several threads together so that the atominfo[] data can be fetched once from the constant memory, stored into registers, and used for multiple grid points. This idea is illustrated in Figure 12.6.
图12.6在多个网格点之间重用信息。
Figure 12.6 Reusing information among multiple grid points.
此外,同一行上的所有网格点都具有相同的y坐标。因此,原子的y坐标与沿行的任何网格点的y坐标之差具有相同的值。在图12.5的DCS内核版本1中,在计算原子和网格点之间的距离时,这个计算是由所有线程对一行中的所有网格点冗余地完成的。我们可以消除这种冗余,提高执行效率。
Furthermore, all grid points along the same row have the same y coordinate. Therefore, the difference between the y coordinate of an atom and the y coordinate of any grid point along a row has the same value. In the DCS kernel version 1 in Figure 12.5, this calculation is redundantly done by all threads for all grid points in a row when calculating the distance between the atom and the grid points. We can eliminate this redundancy and improve the execution efficiency.
这个想法是让每个线程计算多个网格点的静电势。图 12.7中的内核让每个线程计算四个网格点。对于每个原子,代码计算dy,即差值第 2 行中 y 坐标的值。然后计算表达式dy*dy加上预先计算的dz*dz信息,并将其保存到自动变量Dysqpdzsq中,该变量默认分配给寄存器。该值对于所有四个网格点都是相同的。因此, energyvalx1到energyvalx4的计算都可以直接使用寄存器中存储的值。此外,电荷信息也可从常量存储器中获取并存储在自动可变电荷中。类似地,原子的x坐标也从常量内存读取到 auto 变量x中。总而言之,该内核消除了对atominfo[atomid].y的常量内存的3次访问、对atominfo[atomid].x的常量内存的3次访问、对atominfo[atomid].w的常量内存的3次访问、3次浮点减法运算、五个浮点乘法运算和九个浮点加法运算(当处理四个网格点的原子时)。快速检查内核代码图 12.7显示循环的每次迭代对四个网格点执行四次常量内存访问、五次浮点减法、九次浮点加法和五次浮点乘法。
The idea is to have each thread to calculate the electrostatic potential for multiple grid points. The kernel in Figure 12.7 has each thread calculate four grid points. For each atom, the code calculates dy, the difference of the y coordinate in line 2. It then calculates the expression dy∗dy plus the precalculated dz∗dz information and saves it to the auto variable dysqpdzsq, which is assigned to a register by default. This value is the same for all four grid points. Therefore, the calculation of energyvalx1 through energyvalx4 can all just use the value stored in the register. Furthermore, the electrical charge information is also accessed from constant memory and stored in the automatic variable charge. Similarly, the x coordinate of the atom is also read from constant memory into auto variable x. Altogether, this kernel eliminates three accesses to constant memory for atominfo[atomid].y, three accesses to constant memory for atominfo[atomid].x, three accesses to constant memory for atominfo[atomid].w, three floating-point subtraction operations, five floating-point multiply operations, and nine floating-point add operations when processing an atom for four grid points. A quick inspection of the kernel code in Figure 12.7 shows that each iteration of the loop performs four constant memory accesses, five floating-point subtractions, nine floating-point additions, and five floating-point multiplications for four grid points.
图 12.7 DCS 内核版本 2。
Figure 12.7 DCS kernel version 2.
读者还应该验证图 12.5中的 DCS 内核版本执行了 16 次常量内存访问、8 次浮点减法、12 次浮点加法和 12 次浮点乘法,对于相同的四个网格点总共执行了 48 次操作。从图 12.5到图 12.7,总共从 48 次操作减少到 25 次操作,减少幅度相当大。这意味着 G80 上的执行速度从 186 GFLOPS 提高到 259 GFLOPS。在应用级吞吐量方面,性能从每秒 186 亿次原子评估提高到每秒 334 亿次原子评估。应用级性能提升高于FLOPS提升的原因是部分浮点运算被取消了。
Readers should also verify that the version of DCS kernel in Figure 12.5 performs 16 constant memory accesses, 8 floating-point subtractions, 12 floating-point additions, and 12 floating-point multiplications, for a total of 48 operations for the same four grid points. Going from Figure 12.5 to Figure 12.7, there is a total reduction from 48 operations down to 25 operations, a sizable reduction. This is translated into an increased execution speed from 186 GFLOPS to 259 GFLOPS on a G80. In terms of application-level throughput, the performance increases from 18.6 billion atom evaluations per second to 33.4 billion atom evaluations per second. The reason that the application-level performance improvement is higher than the FLOPS improvement is that some of the floating-point operations have been eliminated.
优化的代价是每个线程使用更多的寄存器。这减少了可以分配给每个 SM 的线程数量。然而,正如结果所示,这是一个很好的权衡,具有出色的性能改进。
The cost of the optimization is that more registers are used by each thread. This reduces the number of threads that can be assigned to each SM. However, as the results show, this is a good trade-off with an excellent performance improvement.
虽然图 12.7中 DCS 内核版本 2 的性能相当高,但快速分析运行表明线程执行内存写入的效率很低。如图12.6和12.7所示,每个线程计算四个相邻网格点。这似乎是一个合理的选择。然而,如图12.8所示,线程的访问模式将导致未合并的全局内存写入。
While the performance of the DCS kernel version 2 in Figure 12.7 is quite high, a quick profiling run reveals that the threads perform memory writes inefficiently. As shown in Figures 12.6 and 12.7, each thread calculates four neighboring grid points. This seems to be a reasonable choice. However, as we illustrate in Figure 12.8, the access pattern of threads will result in uncoalesced global memory writes.
图 12.8组织合并写入的线程和内存布局。
Figure 12.8 Organizing threads and memory layout for coalesced writes.
在DCS内核版本2中,有两个问题导致未合并写入。首先,每个线程计算四个相邻的相邻网格点。因此,对于访问energygrid[]数组的每个语句,warp 中的线程不会访问相邻位置。请注意,两个相邻线程访问相距三个元素的内存位置。因此,所有线程在扭曲写入中要写入的 16 个位置被展开,加载/写入位置之间有 3 个元素。这个问题可以通过将相邻的网格点分配给每个半扭曲中的相邻线程来解决。假设我们仍然想让每个线程计算四个网格点,我们首先将 16 个连续的网格点分配给半扭曲中的 16 个线程。然后我们分配接下来的 16 个连续网格指向相同的 16 个线程。我们重复分配,直到每个线程都有所需的网格点数量。该分配如图12.8所示。经过一些实验,对于 G80,每个线程的最佳网格点数量是 8。
There are two problems that cause the uncoalesced writes in DCS kernel version 2. First, each thread calculates four adjacent neighboring grid points. Thus, for each statement that accesses the energygrid[] array, the threads in a warp are not accessing adjacent locations. Note that two adjacent threads access memory locations that are three elements apart. Thus, the 16 locations to be written by all the threads in warp write are spread out, with three elements in between the loaded/written locations. This problem can be solved by assigning adjacent grid points to adjacent threads in each half-warp. Assuming that we still want to have each thread calculate four grid points, we first assign 16 consecutive grid points to the 16 threads in a half-warp. We then assign the next 16 consecutive grid points to the same 16 threads. We repeat the assignment until each thread has the number of grid points desired. This assignment is illustrated in Figure 12.8. With some experimentation, the best number of grid points per thread turns out to be 8 for G80.
图 12.9显示了具有扭曲感知的网格点分配给线程的内核代码。请注意,用于计算距离的x坐标由变量gridspacing_coalescing偏移,该变量是原始网格间距乘以常量BLOCKSIZE (16)。这反映了8个网格点的x坐标彼此相距16个网格点的事实。此外,循环结束后,对energygrid[]数组的内存写入按outaddr、outaddr+BLOCKSIZE、…、outaddr+7*BLOCKSIZE进行索引。这些索引中的每一个都与前一个索引相差一个BLOCKSIZE (16)。该内核的详细线程块组织留作练习。读者应该记住,通过将线程块的x维度大小设置为等于 half-warp 大小 (16),我们可以简化内核中的索引。
The kernel code with a warp-aware assignment of grid points to threads is shown in Figure 12.9. Note that the x coordinates used to calculate the distances are offset by the variable gridspacing_coalescing, which is the original grid spacing times the constant BLOCKSIZE (16). This reflects the fact that the x coordinates of the 8 grid points are 16 grid points away from each other. Also, after the end of the loop, memory writes to the energygrid[] array are indexed by outaddr, outaddr+BLOCKSIZE, …, outaddr+7∗BLOCKSIZE. Each of these indices is one BLOCKSIZE (16) away from the previous one. The detailed thread block organization for this kernel is left as an exercise. Readers should keep in mind that by setting the x dimension size of the thread block to be equal to the half-warp size (16), we can simplify the indexing in the kernel.
图 12.9 DCS 内核版本 3。
Figure 12.9 DCS kernel version 3.
未合并内存写入的另一个原因是energygrid[]数组的布局,它是一个 3D 数组。如果数组的x维度不是 half-warp 大小的倍数,则第二行以及后续行的起始位置将不再位于 16 字边界。在较旧的设备中,这意味着半扭曲访问不会被合并,即使它们写入连续的位置。这个问题可以通过用额外的内容填充每行来纠正元素,以便x维度的总长度是 16 的倍数。这可能需要添加最多 15 个元素,或者每行 60 个字节,如图12.8所示。对于图 12.9的内核, x维度中的元素数量需要是 8×16=128 的倍数。这是因为每个线程在每次迭代中实际上写入 8 个元素。因此,可能需要将最多 127 个元素或 1,016 字节填充到每一行。
The other cause of uncoalesced memory writes is the layout of the energygrid[] array, which is a 3D array. If the x dimension of the array is not a multiple of half-warp size, the beginning location of the second row, as well as those of the subsequent rows, will no longer be at the 16-word boundaries. In older devices, this means that the half-warp accesses will not be coalesced, even though they write to consecutive locations. This problem can be corrected by padding each row with additional elements so that the total length of the x dimension is a multiple of 16. This can require adding up to 15 elements, or 60 bytes to each row, as shown in Figure 12.8. With the kernel of Figure 12.9, the number of elements in the x dimension needs to be a multiple of 8×16=128. This is because each thread actually writes 8 elements in each iteration. Thus, one may need to pad up to 127 elements, or 1,016 bytes, to each row.
此外,最后一行线程块存在潜在问题。由于网格数组可能没有足够的行,因此某些线程可能最终会在网格数据结构之外进行写入。由于网格数据结构是 3D 数组,因此这些线程将写入下一个网格点切片。正如我们在第 3 章中讨论的,我们可以在内核中添加测试,并避免写入超出已知y维度大小的数组元素。然而,这会增加许多开销指令并导致控制发散。另一种解决方案是填充网格结构的y维度,使其包含多个被线程块覆盖的图块。这在图 12.8中显示为网格结构中的底部填充。一般来说,由于这种填充,可能需要添加最多 15 行。
Furthermore, there is a potential problem with the last row of thread blocks. Since the grid array may not have enough rows, some of the threads may end up writing outside the grid data structure. Since the grid data structure is a 3D array, these threads will write into the next slice of grid points. As we discussed in Chapter 3, we can add a test in the kernel and avoid writing the array elements that are out of the known y dimension size. However, this would have added a number of overhead instructions and incurred control divergence. Another solution is to pad the y dimension of the grid structure so that it contains a multiple of tiles covered by thread blocks. This is shown in Figure 12.8 as the bottom padding in the grid structure. In general, one may need to add up to 15 rows due to this padding.
对于较小的网格结构,填充成本可能会很高。例如,如果能量网格在每个 2D 切片中有 100×100 个网格点,则将其填充为 128×112 切片。网格点总数从 10,000 增加到 14,336,即增加 43%。如果我们必须填充整个 3D 结构,网格点将从 100×100×100 (1,000,000) 增加到 128×112×112 (1,605,632),即 60% 的开销!这是我们计算 2D 切片中的能量网格并使用主机代码迭代这些 2D 切片的部分原因。编写单个内核来处理整个 3D 结构会产生更多的额外开销。这种类型的权衡经常出现在仿真模型、微分方程求解器和视频处理应用中。
The cost of padding can be substantial for smaller grid structures. For example, if the energy grid has 100×100 grid points in each 2D slice, it would be padded into a 128×112 slice. The total number of grid points increase from 10,000 to 14,336, or a 43% overhead. If we had to pad the entire 3D structure, the grid points would have increased from 100×100×100 (1,000,000) to 128×112×112 (1,605,632), or a 60% overhead! This is part of the reason why we calculate the energy grids in 2D slices and use the host code to iterate over these 2D slices. Writing a single kernel to process the entire 3D structure would have incurred a lot more extra overhead. This type of trade-off appears frequently in simulation models, differential equation solvers, and video processing applications.
图 12.9所示的 DCS 版本 3 内核在 G80 上实现了约 291 GFLOP,即每秒 395 亿次原子评估。在后来的费米设备上,它达到了 535.16 GFLOPS,即每秒 725.6 亿次原子评估。在最新的 GeForce GTX680 Kepler 1 上,它达到了高达 1267.26 GFLOPS,即每秒 1718.3 亿次原子评估!所测量的内核速度还包括将energygrid[]数组的读取访问从内核开头移动到内核结尾的轻微提升。首先在循环中计算对网格点的贡献。该代码在循环后加载原始网格点数据,将贡献添加到数据中,并将更新后的值写回。尽管此移动将更多的全局内存延迟暴露给每个线程,但它节省了八个寄存器的消耗。由于内核已经使用许多寄存器来保存原子数据和距离,因此所使用的寄存器数量的这种节省缓解了内核的关键瓶颈。这允许为每个SM分配更多的线程块并实现整体性能的提高。
The DCS version 3 kernel shown in Figure 12.9 achieves about 291 GFLOPs, or 39.5 billion atom evaluations per second on a G80. On a later Fermi device, it achieves 535.16 GFLOPS, or 72.56 billion atom evaluations per sec. On a recent GeForce GTX680 Kepler 1, it achieves a whopping 1267.26 GFLOPS, or 171.83 billion atom evaluations per sec! This measured speed of the kernel also includes a slight boost from moving the read access to the energygrid[] array from the beginning of the kernel to the end of the kernel. The contribution to the grid points are first calculated in the loop. The code loads the original grid point data after the loop, adds the contribution to the data, and writes the updated values back. Although this movement exposes more of the global memory latency to each thread, it saves the consumption of eight registers. Since the kernel is already using many registers to hold the atom data and the distances, such savings in number of registers used relieves a critical bottleneck for the kernel. This allows more thread blocks to be assigned to each SM and achieves an overall performance improvement.
图 12.10显示了各种 DCS 内核实现之间的性能比较摘要以及它们与优化的单核 CPU 执行的比较情况。一个重要的观察结果是,内核的相对优点随着网格尺寸长度的变化而变化。然而,一旦网格维度长度大于 300,DCS 版本 3 (CUDA-Unroll8clx) 的性能始终优于所有其他版本。
Figure 12.10 shows a summary of the performance comparison between the various DCS kernel implementations and how they compare with an optimized single-core CPU execution. One important observation is that the relative merit of the kernels varies with grid dimension lengths. However, the DCS version 3 (CUDA-Unroll8clx) performs consistently better than all others once the grid dimension length is larger than 300.
图12.10不同DCS内核版本的性能比较。
Figure 12.10 Performance comparison of various DCS kernel versions.
CPU 性能和 CPU-GPU 联合性能之间的详细比较显示了常见的权衡。图 12.11显示了中等规模网格系统对于不同数量的待评估原子的执行时间图。对于 400 个或更少的原子,CPU 性能更好。这是因为无论要评估的原子数量有多少,GPU 都有 110 毫秒的固定初始化开销。此外,对于少量原子,GPU 利用率较低,因此 GPU 执行时间曲线在 100 到 1,000 个原子之间相当平坦。
A detailed comparison between the CPU performance and the CPU–GPU joint performance shows a commonly observed trade-off. Figure 12.11 shows plot of the execution time of a medium-size grid system for a varying number of atoms to be evaluated. For 400 atoms or fewer, the CPU performs better. This is because the GPU has a fixed initialization overhead of 110 ms regardless of the number of atoms to be evaluated. Also, for a small number of atoms, the GPU is underutilized, thus the curve of the GPU execution time is quite flat between 100 and 1,000 atoms.
图 12.11单线程 CPU 与 CPU-GPU 比较。
Figure 12.11 Single-thread CPU versus CPU–GPU comparison.
图 12.11中的图强化了一个普遍持有的原则,即 GPU 在处理大量数据时表现更好。一旦原子数量达到10,000,GPU就被充分利用。 CPU 和 CPU-GPU 执行时间的斜率几乎相同,对于所有输入大小,CPU-GPU 执行始终比 CPU 执行快 44 倍。
The plot in Figure 12.11 reenforces a commonly held principle that GPUs perform better for large amounts of data. Once the number of atoms reaches 10,000, the GPU is fully utilized. The slopes of the CPU and the CPU–GPU execution time become virtually identical, with the CPU–GPU execution being consistently 44× times faster than the CPU execution for all input sizes.
12.1。 完成 DCS 内核的实现,如图12.5所示。填写所有缺失的声明。为内核启动语句提供所有执行配置参数。
12.1. Complete the implementation of the DCS kernel as outlined in Figure 12.5. Fill in all the missing declarations. Give the kernel launch statement with all the execution configuration parameters.
12.2. 比较图 12.7和图 12.5中内核每次迭代中执行的操作数(内存加载、浮点运算、分支)。请记住,前者的每次迭代对应于后者的四次迭代。
12.2. Compare the number of operations (memory loads, floating-point arithmetic, branches) executed in each iteration of the kernel in Figure 12.7 compared to that in Figure 12.5. Keep in mind that each iteration of the former corresponds to four iterations of the latter.
12.3。 完成图12.9中DCS内核版本3的实现。用您自己的语言解释在此实现中线程访问是如何合并的。
12.3. Complete the implementation of the DCS kernel version 3 in Figure 12.9. Explain in your own words how the thread accesses are coalesced in this implementation.
12.4。 对于图 12.8中的内存填充和图 12.9中的 DCS 内核版本 3 ,说明为什么需要在x维度中填充最多 127 个元素,但在y维度中只需要填充最多 15 个元素。
12.4. For the memory padding in Figure 12.8 and DCS kernel version 3 in Figure 12.9, show why one needs to pad up to 127 elements in the x dimension but only up to 15 elements in the y dimension.
12.5。 给出向 GPU 全局内存中分配的数组添加额外“填充”元素的两个原因,如图12.8所示。
12.5. Give two reasons for adding extra “padding” elements to arrays allocated in the GPU global memory, as shown in Figure 12.8.
12.6。 给出与增加每个 CUDA 线程中完成的工作量相关的两个潜在缺点,如第 12.3 节所示。
12.6. Give two potential disadvantages associated with increasing the amount of work done in each CUDA thread, as shown in Section 12.3.
1. Humphrey W、Dalke A、Schulten K。VMD——视觉分子动力学。分子图学杂志。 1996;14:33-38。
1. Humphrey W, Dalke A, Schulten K. VMD—Visual Molecular Dynamics. Journal of Molecular Graphics. 1996;14:33–38.
2. Stone JE、Phillips JC、Freddolino PL、Hardy DJ、Trabuco LG、Schulten K。利用图形处理器加速分子建模应用。计算化学杂志。 2007;28:2618–2640。
2. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry. 2007;28:2618–2640.
13.1 并行计算的目标
13.1 Goals of Parallel Computing
13.2 问题分解
13.2 Problem Decomposition
13.3 算法选择
13.3 Algorithm Selection
13.4 计算思维
13.4 Computational Thinking
13.5 概括
13.5 Summary
13.6 练习
13.6 Exercises
到目前为止,我们专注于并行编程的实践经验,包括 CUDA 编程模型特征、性能和数值考虑、并行模式和应用案例研究。我们现在将转向更抽象的概念。我们首先将并行编程概括为一种计算思维过程,将领域问题分解为定义明确、协调一致的工作单元,每个工作单元都可以通过高效的数值方法和众所周知的算法来实现。具有强大计算思维能力的程序员不仅可以分析领域问题的结构,还可以转换领域问题的结构:哪些部分本质上是串行的,哪些部分适合高性能并行执行,以及从前一类移动部分所涉及的权衡到后者。通过良好的问题分解,程序员可以选择和实现在并行性、计算效率和内存带宽消耗之间实现适当折衷的算法。为具有挑战性的领域问题创建成功的计算解决方案通常需要领域知识和计算思维技能的强有力结合。本章将使读者更深入地了解并行编程和一般计算思维。
We have so far concentrated on the practical experience of parallel programming, which consists of CUDA programming model features, performance and numerical considerations, parallel patterns, and application case studies. We will now switch gear to more abstract concepts. We will first generalize parallel programming into a computational thinking process of decomposing a domain problem into well-defined, coordinated work units that can each be realized with efficient numerical methods and well-known algorithms. A programmer with strong computational thinking skills not only analyzes but also transforms the structure of a domain problem: which parts are inherently serial, which parts are amenable to high-performance parallel execution, and the trade-offs involved in moving parts from the former category to the latter. With good problem decomposition, the programmer can select and implement algorithms that achieve an appropriate compromise between parallelism, computational efficiency, and memory bandwidth consumption. A strong combination of domain knowledge and computational thinking skills is often needed for creating successful computational solutions to challenging domain problems. This chapter will give readers more insight into parallel programming and computational thinking in general.
在我们讨论并行编程的基本概念之前,我们首先回顾一下人们采用并行计算的三个主要原因。第一个目标是在更短的时间内解决给定的问题。例如,投资公司可能需要在交易后时间对其所有投资组合运行金融投资组合情景风险分析包。在连续计算机上进行这样的分析可能需要 200 个小时。然而,投资组合管理流程可能要求在四个小时内完成分析,以便及时根据该信息做出重大决策。使用并行计算可以加快分析速度并使其在所需的时间窗口内完成。
Before we discuss the fundamental concepts of parallel programming, it is important for us to first review the three main reasons why people adopt parallel computing. The first goal is to solve a given problem in less time. For example, an investment firm may need to run a financial portfolio scenario risk analysis package on all its portfolios during after-trading hours. Such an analysis may require 200 hours on a sequential computer. However, the portfolio management process may require that analysis be completed in four hours to be in time for major decisions based on that information. Using parallel computing may speed up the analysis and allow it to complete within the required time window.
使用并行计算的第二个目标是在给定的时间内解决更大的问题。在我们的金融投资组合分析示例中,投资公司可能能够使用顺序计算在给定时间窗口内对其当前投资组合运行投资组合情景风险分析。然而,该公司正计划扩大其投资组合中的持股数量。扩大的问题规模会导致顺序计算下的分析运行时间超出时间窗口。并行计算可以减少较大问题规模的运行时间,有助于适应计划的产品组合扩展。
The second goal of using parallel computing is to solve bigger problems within a given amount of time. In our financial portfolio analysis example, the investment firm may be able to run the portfolio scenario risk analysis on its current portfolio within a given time window using sequential computing. However, the firm is planning on expanding the number of holdings in its portfolio. The enlarged problem size would cause the running time of analysis under sequential computation to exceed the time window. Parallel computing that reduces the running time of the bigger problem size can help accommodate the planned expansion to the portfolio.
使用并行计算的第三个目标是在给定的时间内为给定的问题获得更好的解决方案。投资公司可能在其投资组合情景风险分析中使用了近似模型。使用更准确的模型可能会增加计算复杂性并增加顺序计算机上的运行时间超出允许的窗口。例如,更准确的模型可能需要使用数字更复杂的公式来考虑更多类型的风险因素之间的相互作用。并行计算减少了更准确模型的运行时间,可以在允许的时间窗口内完成分析。
The third goal of using parallel computing is to achieve better solutions for a given problem and a given amount of time. The investment firm may have been using an approximate model in its portfolio scenario risk analysis. Using a more accurate model may increase the computational complexity and increase the running time on a sequential computer beyond the allowed window. For example, a more accurate model may require consideration of interactions between more types of risk factors using a more numerically complex formula. Parallel computing that reduces the running time of the more accurate model may complete the analysis within the allowed time window.
在实践中,并行计算可以由上述三个目标的组合来驱动。从我们的讨论中可以清楚地看出,并行计算的主要动机是提高速度。第一个目标是通过提高在当前问题规模上运行现有模型的速度来实现的。第二个目标是通过提高现有模型在更大问题规模上的运行速度来实现的。第三个目标是通过提高在当前问题规模上运行更复杂模型的速度来实现的。显然,通过并行计算提高的速度可以用来实现这些目标的组合。例如,并行计算可以减少较大问题规模上更复杂模型的运行时间。
In practice, parallel computing may be driven by a combination of the aforementioned three goals. It should be clear from our discussion that parallel computing is primarily motivated by increased speed. The first goal is achieved by increased speed in running the existing model on the current problem size. The second goal is achieved by increased speed in running the existing model on a larger problem size. The third goal is achieved by increased speed in running a more complex model on the current problem size. Obviously, the increased speed through parallel computing can be used to achieve a combination of these goals. For example, parallel computing can reduce the runtime of a more complex model on a larger problem size.
从我们的讨论中还应该清楚的是,适合并行计算的应用程序通常涉及大问题规模和高建模复杂性。也就是说,这些应用程序处理大量数据,对数据执行多次迭代,或两者兼而有之。对于要通过并行计算解决的此类问题,问题必须以可以分解为可以同时安全解决的子问题的方式表述。在这样的表述和分解下,程序员编写代码并组织数据来同时解决这些子问题。
It should also be clear from our discussion that applications that are good candidates for parallel computing typically involve large problem sizes and high modeling complexity. That is, these applications process a large amount of data, perform many iterations on the data, or both. For such a problem to be solved with parallel computing, the problem must be formulated in such a way that it can be decomposed into subproblems that can be safely solved at the same time. Under such formulation and decomposition, the programmer writes code and organizes data to solve these subproblems concurrently.
在第 11 章和第 12章中,我们提出了两个非常适合并行计算的问题。 MRI重建问题涉及大量的k空间样本数据。每个k空间样本数据也被多次用于计算其对重建体素数据的贡献。为了获得相当高分辨率的重建,每个样本数据都会被使用很多次。我们表明,MRI 重建中 F H D 问题的良好分解是形成子问题,每个子问题计算 F H D 元素的值。所有这些子问题都可以并行解决。我们使用大量的 CUDA 线程来解决这些子问题。
In Chapters 11 and 12 we presented two problems that are good candidates for parallel computing. The MRI reconstruction problem involves a large amount of k-space sample data. Each k-space sample data is also used many times for calculating its contributions to the reconstructed voxel data. For a reasonably high-resolution reconstruction, each sample data is used a very large number of times. We showed that a good decomposition of the FHD problem in MRI reconstruction is to form subproblems that each calculate the value of an FHD element. All these subproblems can be solved in parallel with each other. We use a massive number of CUDA threads to solve these subproblems.
图 12.11进一步表明,只有在有 400 个或更多原子的情况下,才应使用大规模并行 CUDA 设备来解决静电势计算问题。现实的分子动力学系统模型通常涉及至少数十万个原子和数百万个能量网格点。每个原子的静电荷信息在计算其对能量网格点的贡献时被多次使用。我们表明,静电势计算问题的一个很好的分解是形成子问题,每个子问题计算一个网格点的能量值。所有子问题都可以并行求解。我们使用大量的 CUDA 线程来解决这些子问题。
Figure 12.11 further shows that the electrostatic potential calculation problem should be solved with a massively parallel CUDA device only if there are 400 or more atoms. A realistic molecular dynamic system model typically involves at least hundreds of thousands of atoms and millions of energy grid points. The electrostatic charge information of each atom is used many times in calculating its contributions to the energy grid points. We showed that a good decomposition of the electrostatic potential calculation problem is to form subproblems that each calculate the energy value of a grid point. All the subproblems can be solved in parallel with each other. We use a massive number of CUDA threads to solve these subproblems.
并行编程的过程通常可以分为四个步骤:问题分解、算法选择、语言实现和性能调优。最后两个步骤是前面章节的重点。在接下来的两节中,我们将更加普遍和深入地讨论前两个步骤。
The process of parallel programming can typically be divided into four steps: problem decomposition, algorithm selection, implementation in a language, and performance tuning. The last two steps were the focus of previous chapters. In the next two sections, we will discuss the first two steps with more generality as well as depth.
在大型计算问题中寻找并行性在概念上通常很简单,但在实践中可能具有挑战性。关键是要认清工作内容由每个并行执行单元来执行,在CUDA中是一个线程,这样就很好地利用了问题固有的并行性。例如,在静电势图计算问题中,显然可以并行处理所有原子,并且可以并行计算所有能量网格点。然而,在将计算工作分解为并行执行单元时必须小心,这将被称为线程安排。正如我们在12.2节中讨论的,静电势图计算问题的分解可以以原子为中心或以网格为中心。在以原子为中心的线程布置中,每个线程负责计算一个原子对所有网格点的影响。相反,以网格为中心的线程布置使用每个线程来计算所有原子对网格点的影响。
Finding parallelism in large computational problems is often conceptually simple but can be challenging in practice. The key is to identify the work to be performed by each unit of parallel execution, which is a thread in CUDA, so that the inherent parallelism of the problem is well utilized. For example, in the electrostatic potential map calculation problem, it is clear that all atoms can be processed in parallel and all energy grid points can be calculated in parallel. However, one must take care when decomposing the calculation work into units of parallel execution, which will be referred to as threading arrangement. As we discussed in Section 12.2, the decomposition of the electrostatic potential map calculation problem can be atom-centric or grid-centric. In an atom-centric threading arrangement, each thread is responsible for calculating the effect of one atom on all grid points. In contrast, a grid-centric threading arrangement uses each thread to calculate the effect of all atoms on a grid point.
虽然两种线程安排都会导致类似级别的并行执行和相同的执行结果,但它们在给定的硬件系统中可能表现出截然不同的性能。以网格为中心的布置具有称为收集的内存访问行为,其中每个线程将输入原子的效果聚集或收集到网格点中。图 13.1(a)说明了收集访问行为。收集是 CUDA 设备中理想的线程安排,因为线程可以将其结果累积在其私有寄存器中。此外,多个线程共享输入原子值,并且可以有效地使用常量内存缓存或共享内存来节省全局内存带宽。
While both threading arrangements lead to similar levels of parallel execution and same execution results, they can exhibit very different performance in a given hardware system. The grid-centric arrangement has a memory access behavior called gather, where each thread gathers or collects the effect of input atoms into a grid point. Figure 13.1(a) illustrates the gather access behavior. Gather is a desirable thread arrangement in CUDA devices because the threads can accumulate their results in their private registers. Also, multiple threads share input atom values, and can effectively use constant memory caching or shared memory to conserve global memory bandwidth.
图 13.1 (a) 聚集和 (b) 基于分散的线程排列。
Figure 13.1 (a) Gather and (b) scatter based thread arrangements.
另一方面,以原子为中心的安排表现出一种称为分散的内存访问行为,其中每个线程分散或分布原子对网格点的影响。散射行为如图 13.1(b)所示。这在 CUDA 设备中是不受欢迎的安排,因为多个线程可以同时写入同一网格点。网格点必须存储在可由所有涉及的线程写入的存储器中。必须使用原子操作来防止多个线程同时写入网格点期间的竞争条件和值丢失。这些原子操作比以原子为中心的布置中使用的寄存器访问慢得多。了解线程布置的行为和硬件的限制允许并行程序员转向更理想的基于聚集的布置。
The atom-centric arrangement, on the other hand, exhibits a memory access behavior called scatter, where each thread scatters or distributes the effect of an atom into grid points. The scatter behavior is illustrated in Figure 13.1(b). This is an undesirable arrangement in CUDA devices because the multiple threads can write into the same grid point at the same time. The grid points must be stored in a memory that can be written by all the threads involved. Atomic operations must be used to prevent race conditions and loss of value during simultaneous writes to a grid point by multiple threads. These atomic operations are much slower than the register accesses used in the atom-centric arrangement. Understanding the behavior of the threading arrangement and the limitations of hardware allows a parallel programmer to steer toward the more desired gather-based arrangement.
真实的应用程序通常由多个协同工作的模块组成。静电势图计算就是分子动力学应用中的此类模块之一。图 13.2显示了分子动力学应用程序主要模块的概述。对于系统中的每个原子,应用程序需要计算施加在原子上的各种形式的力(例如振动力、旋转力和非键合力)。每种形式的力都通过不同的方法计算。在高层,程序员需要决定如何组织工作。请注意,这些模块之间的工作量可能会有很大差异。非键合力的计算通常涉及许多原子之间的相互作用,并且比振动力和旋转力需要更多的计算。因此,这些模块往往被实现为对力数据结构的单独传递。程序员需要决定每个通道是否值得在 CUDA 设备中实现。例如,他或她可能会认为振动力和旋转力计算不涉及足够的工作量以保证在设备上执行。这样的决定将导致 CUDA 程序启动一个内核,计算所有网格点的非键合力,同时继续计算主机上网格点的振动力和旋转力。更新原子位置和速度的模块也可以在主机上运行。它首先结合来自主机的振动力和旋转力以及来自设备的非结合力。然后,它使用组合力来计算新的原子位置和速度。
A real application often consists of multiple modules that work together. The electrostatic potential map calculation is one such module in molecular dynamics applications. Figure 13.2 shows an overview of major modules of a molecular dynamics application. For each atom in the system, the application needs to calculate the various forms of forces (e.g. vibrational, rotational, and nonbonded) that are exerted on the atom. Each form of force is calculated by a different method. At the high level, a programmer needs to decide how the work is organized. Note that the amount of work can vary dramatically between these modules. The nonbonded force calculation typically involves interactions among many atoms and incurs much more calculations than the vibrational and rotational forces. Therefore, these modules tend to be realized as separate passes over the force data structure. The programmer needs to decide if each pass is worth implementing in a CUDA device. For example, he or she may decide that the vibrational and rotational force calculations do not involve a sufficient amount of work to warrant execution on a device. Such a decision would lead to a CUDA program that launches a kernel that calculates nonbonding forces for all the grid points while continuing to calculate the vibrational and rotational forces for the grid points on the host. The module that updates atomic positions and velocities may also run on the host. It first combines the vibrational and rotational forces from the host and the nonbonding forces from the device. It then uses the combined forces to calculate the new atomic positions and velocities.
图 13.2分子动力学应用程序的主要任务。
Figure 13.2 Major tasks of a molecular dynamics application.
设备完成的工作部分将最终决定并行化实现的应用程序级加速。例如,假设非键合力计算占原始顺序执行时间的 95%,并且使用 CUDA 设备将其加速了 100 倍。进一步假设应用程序的其余部分保留在主机上并且没有获得加速。应用级加速比为1/(5%+95%/100)=1/(5%+0.95%)=1/(5.95%)=17×。这是阿姆达尔定律的演示:并行计算带来的应用程序加速受到应用程序的顺序部分的限制。在这种情况下,即使应用程序的连续部分非常小 (5%),它也会将应用程序级加速限制为 17 倍,尽管非结合力计算的加速率为 100 倍。此示例说明了分解大型应用程序的一个主要挑战:不值得在 CUDA 设备上并行执行的小型活动的累积执行时间可能成为最终用户看到的加速的限制因素。
The portion of work done by the device will ultimately decide the application-level speedup achieved by parallelization. For example, assume that the nonbonding force calculation accounts for 95% of the original sequential execution time and it is accelerated by 100× using a CUDA device. Further assume that the rest of the application remains on the host and receives no speedup. The application-level speedup is 1/(5%+95%/100)=1/(5%+0.95%)=1/(5.95%)=17×. This is a demonstration of Amdahl’s law: the application speedup due to parallel computing is limited by the sequential portion of the application. In this case, even though the sequential portion of the application is quite small (5%), it limits the application-level speedup to 17× even though the nonbonding force calculation has a speedup of 100×. This example illustrates a major challenge in decomposing large applications: the accumulated execution time of small activities that are not worth parallel execution on a CUDA device can become a limiting factor in the speedup seen by the end users.
阿姆达尔定律经常促进任务级并行化。尽管其中一些较小的活动不能保证细粒度的大规模并行执行,但当数据集足够大时,可能需要彼此并行地执行其中一些活动。这可以通过使用多核主机并行执行此类任务来实现。或者,我们可以尝试同时执行多个小内核,每个小内核对应一个任务。以前的 CUDA 设备不支持这种并行性,但新一代设备(例如 Kepler)支持这种并行性。
Amdahl’s law often motivates task-level parallelization. Although some of these smaller activities do not warrant fine-grained massive parallel execution, it may be desirable to execute some of these activities in parallel with each other when the data set is large enough. This could be achieved by using a multicore host to execute such tasks in parallel. Alternatively, we could try to simultaneously execute multiple small kernels, each corresponding to one task. The previous CUDA devices did not support such parallelism but the new generation devices such as Kepler do.
减少顺序任务影响的另一种方法是以分层方式利用数据并行性。例如,在消息传递接口 (MPI) [MPI2009]实现中,分子动力学应用程序通常会将大块空间网格及其相关原子分发到网络计算集群的节点。通过使用每个节点的主机来计算其原子块的振动力和旋转力,我们可以利用多个主机 CPU 来实现这些较小模块的加速。每个节点可以使用一个CUDA设备以更高的加速水平计算非键合力。节点需要交换数据以适应跨块的力和跨块边界移动的原子。我们将在第 19 章讨论 MPI-CUDA 联合编程的更多细节。这里的要点是,MPI 和 CUDA 可以在应用中以互补的方式使用,共同实现大数据集的更高级别的速度。
An alternative approach to reducing the effect of sequential tasks is to exploit data parallelism in a hierarchical manner. For example, in a Message Passing Interface (MPI) [MPI2009] implementation, a molecular dynamics application would typically distribute large chunks of the spatial grids and their associated atoms to nodes of a networked computing cluster. By using the host of each node to calculate the vibrational and rotational force for its chunk of atoms, we can take advantage of multiple host CPUs to achieve speedup for these lesser modules. Each node can use a CUDA device to calculate the nonbonding force at a higher level of speedup. The nodes will need to exchange data to accommodate forces that go across chunks and atoms that move across chunk boundaries. We will discuss more details of joint MPI-CUDA programming in Chapter 19. The main point here is that MPI and CUDA can be used in a complementary way in applications to jointly achieve a higher-level of speed with large data sets.
算法是一个逐步的过程,其中每个步骤都被精确地说明并且可以由计算机执行。算法必须表现出三个基本属性:确定性、有效可计算性和有限性。确定性是指每个步骤都被精确地表述;对于要做什么,没有任何含糊之处。有效的可计算性是指每一步都可以由计算机来执行。有限性意味着必须保证算法终止。
An algorithm is a step-by-step procedure where each step is precisely stated and can be carried out by a computer. An algorithm must exhibit three essential properties: definiteness, effective computability, and finiteness. Definiteness refers to the notion that each step is precisely stated; there is no room for ambiguity as to what is to be done. Effective computability refers to the fact that each step can be carried out by a computer. Finiteness means that the algorithm must be guaranteed to terminate.
给定一个问题,我们通常可以想出多种算法来解决该问题。有些需要比其他更少的计算步骤;有些允许比其他更高程度的并行执行;有些比其他具有更好的数值稳定性;有些比其他消耗更少的内存带宽。不幸的是,通常没有一种算法在所有四个方面都比其他算法更好。给定问题和分解策略,并行程序员通常需要选择一种算法,以实现给定硬件系统的最佳折衷方案。
Given a problem, we can typically come up with multiple algorithms to solve the problem. Some require fewer steps of computation than others; some allow higher degrees of parallel execution than others; some have better numerical stability than others; and some consume less memory bandwidth than others. Unfortunately, there is often not a single algorithm that is better than others in all the four aspects. Given a problem and a decomposition strategy, a parallel programmer often needs to select an algorithm that achieves the best compromise for a given hardware system.
在我们的矩阵-矩阵乘法示例中,我们决定通过让每个线程计算输出元素的点积来分解问题。鉴于这种分解,我们提出了两种不同的算法。4.3 节中的算法是一个简单的算法,其中每个线程只执行整个点积。尽管该算法充分利用了分解中可用的并行性,但它消耗了太多的全局内存带宽。在5.4节中,我们介绍了平铺,这是一种节省内存带宽的重要算法策略。请注意,平铺算法将点积划分为多个阶段。图块中涉及的所有线程必须彼此同步,以便它们可以协作地将输入数据图块加载到共享内存中,并在进入下一阶段之前共同利用加载的数据。如图5.12所示,与原始算法相比,平铺算法要求每个线程执行更多语句,并且在索引输入数组时会产生更多开销。然而,它运行很多速度更快,因为它消耗的全局内存带宽要少得多。总的来说,平铺是矩阵应用实现高性能的最重要的算法策略之一。
In our matrix–matrix multiplication example, we decided to decompose the problem by having each thread compute the dot product for an output element. Given this decomposition, we presented two different algorithms. The algorithm in Section 4.3 is a straightforward algorithm where every thread simply performs an entire dot product. Although the algorithm fully utilizes the parallelism available in the decomposition, it consumes too much global memory bandwidth. In Section 5.4, we introduced tiling, an important algorithm strategy for conserving memory bandwidth. Note that the tiled algorithm partitions the dot products into phases. All threads involved in a tile must synchronize with each other so that they can collaboratively load the tile of input data into the shared memory and collectively utilize the loaded data before they move on to the next phase. As we showed in Figure 5.12, the tiled algorithm requires each thread to execute more statements and incur more overhead in indexing the input arrays than the original algorithm. However, it runs much faster because it consumes much less global memory bandwidth. In general, tiling is one of the most important algorithm strategies for matrix applications to achieve high performance.
正如我们在第 6.4 节和第 12.3节中演示的那样,我们可以系统地合并线程以实现更高水平的指令和内存访问效率。在第 6.4 节中,处理相邻图块相同列的线程被组合成一个新线程。这使得新线程在计算多个点积时只需访问每个M元素一次,从而减少了执行的地址计算和内存加载指令的数量。它还进一步减少了全局内存带宽的消耗。同样的技术,当应用于静电势计算中的DCS内核时,进一步减少了距离计算的数量,同时实现了地址计算和内存加载指令的类似减少。
As we demonstrated in Sections 6.4 and 12.3, we can systematically merge threads to achieve a higher level of instruction and memory access efficiency. In Section 6.4, threads that handle the same columns of neighboring tiles are combined into a new thread. This allows the new thread to access each M element only once while calculating multiple dot products, reducing the number of address calculation and memory load instructions executed. It also further reduces the consumption of global memory bandwidth. The same technique, when applied to the DCS kernel in electrostatic potential calculation, further reduces the number of distance calculations while achieving similar reduction in address calculations and memory load instructions.
人们通常可以想出更激进的算法策略。一种重要的算法策略,称为截止分箱(cutoff binning),可以通过牺牲少量的精度来显着提高网格算法的执行效率。这是基于这样的观察:许多网格计算问题都基于物理定律,其中远离网格点的粒子或样本的数值贡献可以用隐式方法以低得多的计算复杂度来集体处理。图 13.3中的静电势计算对此进行了说明。图 13.3(a)显示了第 12 章中讨论的直接求和算法。每个网格点接收所有原子的贡献。虽然这是一种非常并行的方法,并且在中等规模的能源网格系统中比仅使用 CPU 执行实现了出色的加速,如图12.11所示,但它不能很好地扩展到非常大的能源网格系统,其中原子数量与系统的音量。计算量随着体积的平方而增加。对于大容量系统,这种增加使得计算时间过长,即使对于大规模并行设备也是如此。
One can often come up with even more aggressive algorithm strategies. An important algorithm strategy, referred to as cutoff binning, can significantly improve the execution efficiency of grid algorithms by sacrificing a small amount of accuracy. This is based on the observation that many grid calculation problems are based on physical laws where numerical contributions from particles or samples that are far away from a grid point can be collectively treated with an implicit method at much lower computational complexity. This is illustrated for the electrostatic potential calculation in Figure 13.3. Figure 13.3(a) shows the direct summation algorithms discussed in Chapter 12. Each grid point receives contributions from all atoms. While this is a very parallel approach and achieves excellent speedup over CPU-only execution for moderate-size energy grid systems, as we showed in Figure 12.11, it does not scale well to very large energy grid systems where the number of atoms increases proportional to the volume of the system. The amount of computation increases with the square of the volume. For large-volume systems, such an increase makes the computation excessively long even for massively parallel devices.
图 13.3截止求和算法。
Figure 13.3 Cutoff summation algorithm.
在实践中,我们知道每个网格点都需要接收靠近它的原子的贡献。远离网格点的原子对网格点处的能量值的贡献可以忽略不计,因为其贡献与距离成反比。图 13.3(b)用围绕网格点绘制的圆圈说明了这一观察结果。圆(栗色)外的原子对网格点能量的贡献可以忽略不计。如果我们可以设计一种算法,其中每个网格点仅接收其坐标(绿色)固定半径内的原子的贡献,则计算复杂度算法将简化为与系统体积成线性比例。这将使算法的计算时间与系统的体积成线性比例。此类算法已广泛用于顺序计算。
In practice, we know that each grid point needs to receive contributions from atoms that are close to it. The atoms that are far away from a grid point will have negligible contribution to the energy value at the grid point because the contribution is inversely proportional to the distance. Figure 13.3(b) illustrates this observation with a circle drawn around a grid point. The contributions to the grid point energy from atoms outside the circle (maroon) are negligible. If we can devise an algorithm where each grid point only receives contributions from atoms within a fixed radius of its coordinate (green), the computational complexity of the algorithm would be reduced to linearly proportional to the volume of the system. This would make the computation time of the algorithm linearly proportional to the volume of the system. Such algorithms have been used extensively in sequential computation.
在顺序计算中,一种简单的截止算法一次处理一个原子。对于每个原子,该算法迭代位于原子坐标半径内的网格点。这是一个简单的过程,因为网格点位于一个数组中,可以轻松地将其索引为其坐标的函数。然而,这个简单的过程并不容易并行执行。原因就是我们在13.2 节中讨论的:以原子为中心的分解由于分散了内存访问行为而不能很好地工作。然而,正如我们在第 9 章中讨论的那样,并行算法与高效顺序算法的工作效率相匹配非常重要。
In sequential computing, a simple cutoff algorithm handles one atom at a time. For each atom, the algorithm iterates through the grid points that fall within a radius of the atom’s coordinate. This is a straightforward procedure since the grid points are in an array that can be easily indexed as a function of their coordinates. However, this simple procedure does not carry easily to parallel execution. The reason is what we discussed in Section 13.2: the atom-centric decomposition does not work well due to it scatter memory access behavior. However, as we discussed in Chapter 9, it is important that a parallel algorithm matches the work efficiency of an efficient sequential algorithm.
因此,我们需要找到一种基于网格中心分解的截止分箱算法:每个线程计算一个网格点的能量值。幸运的是,有一种众所周知的方法可以将直接求和算法(如图12.9中的算法)改编为截止分箱算法。罗德里格斯等人。提出了一种针对静电势问题的算法[RSH 2008]。
Therefore, we need to find a cutoff binning algorithm based on the grid-centric decomposition: each thread calculates the energy value at one grid point. Fortunately, there is a well-known approach to adapting the direct summation algorithm, such as the one in Figure 12.9, into a cutoff binning algorithm. Rodrigues et al. present such an algorithm for the electrostatic potential problem [RSH 2008].
该算法的关键思想是首先根据输入原子的坐标将其分类到箱中。每个bin对应于网格空间中的一个盒子,它包含坐标落入该盒子的所有原子。我们将网格点的 bin 的“邻域”定义为包含可以对网格点的能量值做出贡献的所有原子的 bin 的集合。如果我们有一种有效的方法来管理所有网格点的邻域箱,我们可以通过检查网格点的邻域箱来计算网格点的能量值。图 13.3(c)对此进行了说明。虽然图 13.3(c)仅显示了一层 (2D) 的 bin,这些 bin 直接围绕包含网格点作为其邻域的 bin,但实际算法通常在网格的邻域中具有多层 (3D) bin。在此算法中,所有线程都迭代其自己的邻域。他们使用块和线程索引来识别适当的容器。请注意,周围垃圾箱中的某些原子可能不会落入半径内。因此,在处理原子时,所有线程都需要检查该原子是否落入其半径范围内。这可能会导致扭曲中的线程之间出现一些控制分歧。
The key idea of the algorithm is to first sort the input atoms into bins according to their coordinates. Each bin corresponds to a box in the grid space and it contains all atoms of which the coordinate falls into the box. We define a “neighborhood” of bins for a grid point to be the collection of bins that contain all the atoms that can contribute to the energy value of a grid point. If we have an efficient way of managing neighborhood bins for all grid points, we can calculate the energy value for a grid point by examining the neighborhood bins for the grid point. This is illustrated in Figure 13.3(c). Although Figure 13.3(c) shows only one layer (2D) of bins that immediately surround that containing a grid point as its neighborhood, a real algorithm will typically have multiple layers (3D) of bins in a grid’s neighborhood. In this algorithm, all threads iterate through their own neighborhood. They use their block and thread indices to identify the appropriate bins. Note that some of the atoms in the surrounding bins may not fall into the radius. Therefore, when processing an atom, all threads need to check if the atom falls into its radius. This can cause some control divergence among threads in a warp.
工作效率提高的主要来源来自于这样一个事实:每个线程现在检查大型网格系统中小得多的原子集。然而,这使得恒定记忆对于保存原子的吸引力大大降低。由于线程块将访问不同的邻域,因此有限大小的常量内存不太可能容纳所有活动线程块所需的所有原子。这促使使用全局内存来保存更大的原子集。为了减少带宽消耗,块中的线程协作将公共邻域中的原子信息加载到共享内存中。然后所有线程检查共享内存中的原子。读者可参考 Rodrigues 等人。[RHS2008]了解该算法的更多细节。
The main source of improvement in work efficiency comes from the fact that each thread now examines a much smaller set of atoms in a large grid system. This, however, makes constant memory much less attractive for holding the atoms. Since thread blocks will be accessing different neighborhoods, the limited-size constant memory will unlikely be able to hold all the atoms that are needed by all active thread blocks. This motivates the use of global memory to hold a much larger set of atoms. To mitigate the bandwidth consumption, threads in a block collaborate in loading the atom information in the common neighborhood into the shared memory. All threads then examine the atoms out of shared memory. Readers are referred to Rodrigues et al. [RHS2008] for more details of this algorithm.
分箱的一个微妙问题是分箱最终可能含有不同数量的原子。由于原子在网格系统中是统计分布的,所以一些箱可能有很多原子,而一些箱最终可能根本没有原子。为了保证内存合并,重要的是所有 bin 具有相同的大小并在适当的合并边界上对齐。为了容纳原子数量最多的箱,我们需要使所有其他箱的大小相同。这需要我们用电荷为0的虚拟原子填充许多仓,这会产生两个负面影响。首先,虚拟原子仍然占用全局内存和共享内存存储。它们还消耗设备的数据传输带宽。其次,虚拟原子延长了容器中几乎没有真实原子的线程块的执行时间。
One subtle issue with binning is that bins may end up with a different number of atoms. Since the atoms are statistically distributed in the grid system, some bins may have lots of atoms and some bins may end up with no atom at all. To guarantee memory coalescing, it is important that all bins are of the same size and aligned at appropriate coalescing boundaries. To accommodate the bins with the largest number of atoms, we would need to make the size of all other bins the same size. This would require us to fill many bins with dummy atoms of which the electrical charge is 0, which causes two negative effects. First, the dummy atoms still occupy global memory and shared memory storage. They also consumer data transfer bandwidth to the device. Second, the dummy atoms extend the execution time of the thread blocks of which the bins have few real atoms.
一个众所周知的解决方案是将箱大小设置在合理的水平,通常远小于箱中原子的最大可能数量。分箱过程维护一个溢出列表。当处理一个原子时,如果该原子的 home bin 已满,则将该原子添加到溢出列表中。设备完成内核后,结果网格点能量值将传输回主机。主机对溢出列表中的原子执行顺序截止算法,以完成这些溢出原子缺失的贡献。只要溢出原子只占原子的一小部分,溢出原子的额外顺序处理时间通常比设备执行时间短。人们还可以设计内核,以便每次内核调用都计算网格点子体积的能量值。每个内核完成后,主机启动下一个内核并处理已完成内核的溢出原子。因此,主机将在设备执行下一个内核时处理溢出原子。这种方法可以隐藏处理溢出原子时的大部分(如果不是全部)延迟,因为它是与下一个内核的执行并行完成的。
A well-known solution is to set the bin size at a reasonable level, typically much smaller than the largest possible number of atoms in a bin. The binning process maintains an overflow list. When processing an atom, if the atom’s home bin is full, the atom is added to the overflow list instead. After the device completes a kernel, the result grid point energy values are transferred back to the host. The host executes a sequential cutoff algorithm on the atoms in the overflow list to complete the missing contributions from these overflow atoms. As long as the overflow atoms account for only a small percentage of the atoms, the additional sequential processing time of the overflow atoms is typically shorter than that of the device execution time. One can also design the kernel so that each kernel invocation calculates the energy values for a subvolume of grid points. After each kernel completes, the host launches the next kernel and processes the overflow atoms for the completed kernel. Thus, the host will be processing the overflow atoms while the device executes the next kernel. This approach can hide most if not all the delays in processing overflow atoms since it is done in parallel with the execution of the next kernel.
图 13.4显示了各种静电势图算法的可扩展性和性能的比较。请注意,CPU-SSE3 曲线基于顺序截止算法。对于体积较小的地图(大约 1,000 埃3 ),主机(带有 SSE 的 CPU)的执行速度比图 13.4所示的 DCS 内核更快。这是因为对于如此小的体积,没有足够的工作来充分利用 CUDA 设备。然而,对于中等数量(2,000 至 500,000 埃3之间),由于其大量执行,直接求和内核的性能明显优于主机。正如我们预期的那样,当体积大小达到约 1,000,000 Angstrom 3时,直接求和内核的扩展性很差,并且在 CPU 上运行的时间比顺序算法长!这是因为DCS内核的算法复杂度高于顺序算法,因此内核完成的工作量比顺序算法增长得快得多。对于大于 1,000,000 Angstrom 3的卷大小,工作量非常大,以至于会淹没硬件执行资源。
Figure 13.4 shows a comparison of scalability and performance of the various electrostatic potential map algorithms. Note that the CPU-SSE3 curve is based on a sequential cutoff algorithm. For a map with small volumes, around 1,000 Angstrom3, the host (CPU with SSE) executes faster than the DCS kernel shown in Figure 13.4. This is because there is not enough work to fully utilize a CUDA device for such a small volume. However, for moderate volumes, between 2,000 and 500,000 Angstrom3, the direct summation kernel performs significantly better than the host due to its massive execution. As we anticipated, the direct summation kernel scales poorly when the volume size reaches about 1,000,000 Angstrom3, and runs longer than the sequential algorithm on the CPU! This is due to the fact that the algorithm complexity of the DCS kernel is higher than the sequential algorithm and thus the amount of work done by the kernel grows much faster than that done by the sequential algorithm. For a volume size larger than 1,000,000 Angstrom3, the amount of work is so large that it swamps the hardware execution resources.
图 13.4用于计算静电势图的差分算法的可扩展性和性能。
Figure 13.4 Scalability and performance of difference algorithms for calculating an electrostatic potential map.
图 13.4还显示了三种分箱截止算法的运行时间。 LargeBin算法是 DCS 内核针对截断的直接改编。内核被设计为处理网格点的子体积。在每次内核启动之前,CPU 都会传输子体积中所有网格点的组合邻域内的所有原子。这些原子仍然存储在常量存储器中。所有线程都会检查联合邻域中的所有原子。内核的优点是简单。它本质上与具有相对较大的预选原子邻域的直接求和内核相同。请注意,LargeBin方法对于中等容量的情况表现得相当好,对于大容量的情况则可以很好地扩展。
Figure 13.4 also shows the running time of three binned cutoff algorithms. The LargeBin algorithm is a straightforward adaptation of the DCS kernel for the cutoff. The kernel is designed to process a subvolume of the grid points. Before each kernel launch, the CPU transfers all atoms that are in the combined neighborhood of all the grid points in the subvolume. These atoms are still stored in the constant memory. All threads examine all atoms in the joint neighborhood. The advantage of the kernel is its simplicity. It is essentially the same as the direct summation kernel with a relatively large, preselected neighborhood of atoms. Note that the LargeBin approach performs reasonably well for moderate volumes and scales well for large volumes.
SmallBin算法允许运行同一内核的线程处理不同的原子邻域。这是使用全局内存和共享内存来存储原子的算法。该算法比LargeBin算法具有更高的效率,因为每个线程需要检查的原子数量更少。对于中等体积(大约 8,000 Angstrom 3 ),LargeBin算法稍微优于SmallBin算法。原因是SmallBin算法在将原子从全局内存加载到共享内存时确实会产生更多的指令开销。对于中等体积,整个系统中的原子数量有限。检查较少数量原子的能力并不能提供足够的优势来克服额外的指令开销。然而,差异很小(8,000 Angstrom 3),因此SmallBin算法在所有卷大小上仍然明显获胜。 SmallBin -Overlap算法将顺序溢出原子处理与下一个内核执行重叠。与SmallBin算法相比,它在运行时间方面提供了轻微但显着的改进。 SmallBin –Overlap算法比有效实现的顺序 CPU-SSE 截止算法实现了 17 倍的加速,并针对大容量保持了相同的可扩展性。
The SmallBin algorithm allows the threads running the same kernel to process a different neighborhood of atoms. This is the algorithm that uses global memory and shared memory for storing atoms. The algorithm achieves higher efficiency than the LargeBin algorithm because each thread needs to examine a smaller number of atoms. For moderate volumes, around 8,000 Angstrom3, the LargeBin algorithm slightly outperforms the SmallBin algorithm. The reason is that the SmallBin algorithm does incur more instruction overhead for loading atoms from global memory into shared memory. For a moderate volume, there is a limited number of atoms in the entire system. The ability to examine a smaller number of atoms does not provide sufficient advantage to overcome the additional instruction overhead. However, the difference is so small at 8,000 Angstrom3 that the SmallBin algorithm is still a clear win across all volume sizes. The SmallBin-Overlap algorithm overlaps the sequential overflow atom processing with the next kernel execution. It provides a slight but noticeable improvement in running time over the SmallBin algorithm. The SmallBin–Overlap algorithm achieves a 17× speedup over an efficiently implemented sequential CPU-SSE cutoff algorithm, and maintains the same scalability for large volumes.
计算思维可以说是并行应用程序开发最重要的方面[Wing2006]。我们将计算思维定义为根据计算步骤和算法制定领域问题的思维过程。与任何其他思维过程和解决问题的技能一样,计算思维是一门艺术。正如我们在第 1 章中提到的,我们认为最好通过迭代方法来教授计算思维,让学生在实践经验和抽象概念之间来回切换。
Computational thinking is arguably the most important aspect of parallel application development [Wing2006]. We define computational thinking as the thought process of formulating domain problems in terms of computation steps and algorithms. Like any other thought processes and problem-solving skills, computational thinking is an art. As we mentioned in Chapter 1, we believe that computational thinking is best taught with an iterative approach where students bounce back and forth between practical experience and abstract concepts.
第 12 章和本章中使用的静电势图内核是计算思维的良好示例。为了开发一种解决静电势图问题的高效并行应用程序,必须对问题进行良好的高级分解。正如我们在第 13.2 节中所示,必须清楚地了解所需的(例如,在 CUDA 中聚集)和不需要的(例如,在 CUDA 中分散)内存访问行为才能做出明智的决定。
The electrostatic potential map kernels used in Chapter 12 and this chapter serve as good examples of computational thinking. To develop an efficient parallel application that solves the electrostatic potential map problem, one must come up with a good high-level decomposition of the problem. As we showed in Section 13.2, one must have a clear understanding of the desirable (e.g., gather in CUDA) and undesirable (e.g., scatter in CUDA) memory access behaviors to make a wise decision.
考虑到问题的分解,并行程序员面临着一项潜在的艰巨任务,即设计算法以克服并行性、执行效率和内存带宽消耗方面的主要挑战。有大量关于各种算法技术的文献,但这些文献可能很难理解。全面介绍可用技术超出了本书的范围。我们确实讨论了一组具有广泛适用性的重要技术。虽然这些技术基于 CUDA,但它们可以帮助读者建立一般计算思维的基础。我们相信,当我们自下而上学习时,人类理解得最好。也就是说,我们首先在特定编程模型的上下文中学习概念,这为我们在将知识推广到其他编程模型之前提供了坚实的基础。对CUDA模型的深入体验也使我们获得成熟,这将有助于我们学习甚至可能与CUDA模型无关的概念。
Given a problem decomposition, parallel programmers face a potentially overwhelming task of designing algorithms to overcome major challenges in parallelism, execution efficiency, and memory bandwidth consumption. There is a very large volume of literature on a wide range of algorithm techniques that can be hard to understand. It is beyond the scope of this book to have a comprehensive coverage of the available techniques. We did discuss a substantial set of techniques that have broad applicability. While these techniques are based on CUDA, they help readers build up the foundation for computational thinking in general. We believe that humans understand best when we learn from the bottom up. That is, we first learn the concepts in the context of a particular programming model, which provides us with solid footing before we generalize our knowledge to other programming models. An in-depth experience with the CUDA model also enables us to gain maturity, which will help us learn concepts that may not even be pertinent to the CUDA model.
并行程序员要成为有效的计算思考者,需要具备多种技能。我们将这些基本技能总结如下:
There is a myriad of skills needed for a parallel programmer to be an effective computational thinker. We summarize these foundational skills as follows:
• 计算机体系结构:内存组织、缓存和局部性、内存带宽、SIMT 与 SPMD 与 SIMD 执行,以及浮点精度与准确度。这些概念对于理解算法之间的权衡至关重要。
• Computer architecture: memory organization, caching and locality, memory bandwidth, SIMT versus SPMD versus SIMD execution, and floating-point precision versus accuracy. These concepts are critical in understanding the trade-offs between algorithms.
• 编程模型和编译器:并行执行模型、可用内存类型、数组数据布局和线程粒度转换。需要这些概念来思考数据结构和循环结构的安排,以实现更好的性能。
• Programming models and compilers: parallel execution models, types of available memories, array data layout, and thread granularity transformation. These concepts are needed for thinking through the arrangements of data structures and loop structures to achieve better performance.
• 算法技术:平铺、截止、分散-聚集、分箱等。这些技术构成了设计高级并行算法的工具箱。了解这些技术的可扩展性、效率和内存带宽影响对于计算思维至关重要。
• Algorithm techniques: tiling, cutoff, scatter–gather, binning, and others. These techniques form the toolbox for designing superior parallel algorithms. Understanding the scalability, efficiency, and memory bandwidth implications of these techniques is essential in computational thinking.
• 领域知识:数值方法、精度、准确度和数值稳定性。了解这些基本规则可以让开发人员在应用算法技术时更具创造性。
• Domain knowledge: numerical methods, precision, accuracy, and numerical stability. Understanding these ground rules allows a developer to be much more creative in applying algorithm techniques.
我们编写本书的目标是为所有四个领域提供坚实的基础。读完本书后,读者应继续扩大这些领域的知识。最重要的是,培养更多计算思维技能的最佳方法是不断用优秀的计算解决方案解决具有挑战性的问题。
Our goal for this book is to provide a solid foundation for all the four areas. Readers should continue to broaden their knowledge in these areas after finishing this book. Most importantly, the best way of building up more computational thinking skills is to keep solving challenging problems with excellent computational solutions.
综上所述,我们讨论了算法选择和计算思维的主要维度。关键的教训是,给定问题分解决策,程序员通常必须从各种算法中进行选择。其中一些算法在保持相同数值精度的同时实现了不同的权衡。其他方法则涉及牺牲一定程度的准确性来实现更具可扩展性的运行时间。截止策略可能是此类策略中最受欢迎的。尽管我们在静电势图计算的背景下引入了截止,但它仍用于许多领域,包括图形中的光线追踪和游戏中的碰撞检测。计算思维技能使算法设计者能够克服障碍并找到良好的解决方案。
In summary, we have discussed the main dimensions of algorithm selection and computational thinking. The key lesson is that given a problem decomposition decision, programmers will typically have to select from a variety of algorithms. Some of these algorithms achieve different trade-offs while maintaining the same numerical accuracy. Others involve sacrificing some level of accuracy to achieve much more scalable running times. The cutoff strategy is perhaps the most popular of such strategies. Even though we introduced cutoff in the context of electrostatic potential map calculation, it is used in many domains including ray tracing in graphics and collision detection in games. Computational thinking skills allow an algorithm designer to work around the roadblocks and reach a good solution.
13.1 编写一个主机函数来执行原子分箱。确定 bin 的数组表示形式。考虑合并要求。确保每个线程都能轻松找到它需要处理的垃圾箱。
13.1 Write a host function to perform binning of atoms. Determine the representation of the bins as arrays. Think about coalescing requirements. Make sure that every thread can easily find the bins it needs to process.
13.2 编写截止核函数的一部分,根据原子和网格点的坐标确定原子是否位于网格点的邻域内。
13.2 Write the part of the cutoff kernel function that determines if an atom is in the neighborhood of a grid point based on the coordinates of the atoms and the grid points.
1. 消息传递接口论坛,MPI — 消息传递接口标准版本 2.2,可在以下位置获取:< http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf >,04.09.09。
1. Message Passing Interface Forum, MPI—A Message Passing Interface Standard Version 2.2, Available at: <http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf>, 04.09.09.
2. 马特森 TG、桑德斯 BA、马辛吉尔 BL。并行编程模式阅读,马萨诸塞州:Addison-Wesley Professional; 2004年。
2. Mattson TG, Sanders BA, Massingill BL. Patterns of Parallel Programming Reading, MA: Addison-Wesley Professional; 2004.
3. Rodrigues, CI, Stone, J., Hardy, D., Hwu, WWGPU 基于截止的势求和加速。2008 年 ACM 计算前沿会议,意大利:2008 年 5 月。
3. Rodrigues, C. I., Stone, J., Hardy, D., Hwu, W. W.GPU Acceleration of Cutoff-Based Potential Summation. ACM Computing Frontier Conference 2008, Italy: May 2008.
4. Wing J. 计算思维。ACM 的通讯。 2006;49,第 33-35 页。
4. Wing J. Computational thinking. Communications of the ACM. 2006;49 pp. 33–35.
14.1 背景
14.1 Background
14.2 数据并行模型
14.2 Data Parallelism Model
14.3 设备架构
14.3 Device Architecture
14.4 内核函数
14.4 Kernel Functions
14.5 设备管理和内核启动
14.5 Device Management and Kernel Launch
14.6 OpenCL 中的静电势图
14.6 Electrostatic Potential Map in OpenCL
14.7 概括
14.7 Summary
14.8 练习
14.8 Exercises
现在我们已经讨论了使用 CUDA C 的高性能并行编程,我们想介绍另一种利用 GPU 和 CPU 异构计算系统的并行计算能力的方法:OpenCL TM。在本章中,我们将为 CUDA 程序员简要概述 OpenCL。 OpenCL 的基本编程模型与 CUDA 非常相似,以至于大多数功能都一一对应。了解 CUDA 后,您将能够开始使用本章介绍的材料编写 OpenCL 程序。我们认为,学习 OpenCL 的最佳方法实际上是首先学习 CUDA,然后将 OpenCL 功能映射到其 CUDA 等效项。
Now that we have discussed high-performance parallel programming using CUDA C, we would like to introduce another way to exploit the parallel computing capabilities of heterogeneous computing systems with GPUs and CPUs: OpenCLTM. In this chapter, we will give a brief overview of OpenCL for CUDA programmers. The fundamental programming model of OpenCL is so similar to CUDA that there is a one-to-one correspondence for most features. With your understanding of CUDA, you will be able to start writing OpenCL programs with the material presented in this chapter. In our opinion, the best way to learn OpenCL is actually to learn CUDA first and then map the OpenCL features to their CUDA equivalents.
OpenCL是一种基于C语言的标准化、跨平台并行计算API。它旨在促进开发用于具有异构计算设备的系统的可移植并行应用程序。 OpenCL 的开发源于对快速增长的各种并行计算平台的标准化高性能应用程序开发平台的需求。特别是,它解决了异构并行计算系统的先前编程模型的重大应用程序可移植性限制。
OpenCL is a standardized, cross-platform parallel computing API based on the C language. It is designed to enable the development of portable parallel applications for systems with heterogeneous computing devices. The development of OpenCL was motivated by the need for a standardized high-performance application development platform for the fast-growing variety of parallel computing platforms. In particular, it addresses significant application portability limitations of the previous programming models for heterogeneous parallel computing systems.
基于 CPU 的并行编程模型通常基于 OpenMP 等标准,但通常不包含高性能程序员使用特殊内存类型或 SIMD(单指令、多数据)执行。 CPU-GPU 联合异构并行编程模型(例如 CUDA)具有解决复杂内存层次结构和 SIMD 执行的结构,但特定于平台、供应商或硬件。这些限制使得应用程序开发人员很难从单个多平台源代码库访问 CPU、GPU 和其他类型处理单元的计算能力。
CPU-based parallel programming models have been typically based on standards such as OpenMP but usually do not encompass the use of special memory types or SIMD (single instruction, multiple data) execution by high-performance programmers. Joint CPU–GPU heterogeneous parallel programming models such as CUDA have constructs that address complex memory hierarchies and SIMD execution but have been platform-, vendor-, or hardware-specific. These limitations make it difficult for an application developer to access the computing power of CPUs, GPUs, and other types of processing units from a single multiplatform source code base.
OpenCL 的开发由 Apple 发起并由 Khronos Group 管理,该组织也管理 OpenGL 标准。一方面,它在支持异构并行计算、数据并行性和复杂内存层次结构的单一代码库方面大量依赖 CUDA。这就是为什么一旦我们连接术语,CUDA 程序员就会发现 OpenCL 的这些方面很熟悉。读者将特别欣赏 OpenCL 和低级 CUDA 驱动程序模型之间的相似之处。
The development of OpenCL was initiated by Apple and managed by the Khronos Group, the same group that manages the OpenGL standard. On one hand, it draws heavily on CUDA in the areas of supporting a single code base for heterogeneous parallel computing, data parallelism, and complex memory hierarchies. This is the reason why a CUDA programmer will find these aspects of OpenCL familiar once we connect the terminologies. Readers will especially appreciate the similarities between OpenCL and the low-level CUDA driver model.
另一方面,OpenCL拥有更复杂的平台和设备管理模型,体现了其对多平台和多供应商可移植性的支持。 AMD/ATI 和 NVIDIA GPU 以及 X86 CPU 上已经存在 OpenCL 实现。原则上,我们可以设想在其他类型的设备上实现 OpenCL,例如数字信号处理器 (DSP) 和现场可编程门阵列 (FPGA)。虽然 OpenCL 标准旨在支持跨不同供应商生产的设备的代码可移植性,但这种可移植性并不是免费的。 OpenCL 程序必须准备好处理更大的硬件多样性,因此将表现出更高的复杂性。此外,许多 OpenCL 功能是可选的,可能并非所有设备都支持。可移植的 OpenCL 代码需要避免使用这些可选功能。然而,其中一些可选功能允许应用程序在支持它们的设备中实现显着更高的性能。因此,可移植的 OpenCL 代码可能无法在任何设备上发挥其性能潜力。因此,人们应该期望一种可在多个设备上实现高性能的便携式应用程序将采用复杂的运行时测试,并根据所使用的实际设备的功能在多个代码路径中进行选择。
On the other hand, OpenCL has a more complex platform and device management model that reflects its support for multiplatform and multivendor portability. OpenCL implementations already exist on AMD/ATI and NVIDIA GPUs as well as X86 CPUs. In principle, one can envision OpenCL implementations on other types of devices such as digital signal processors (DSPs) and field programmable gate arrays (FPGAs). While the OpenCL standard is designed to support code portability across devices produced by different vendors, such portability does not come for free. OpenCL programs must be prepared to deal with much greater hardware diversity and thus will exhibit more complexity. Also, many OpenCL features are optional and may not be supported on all devices. A portable OpenCL code will need to avoid using these optional features. However, some of these optional features allow applications to achieve significantly more performance in devices that support them. As a result, a portable OpenCL code may not be able to achieve its performance potential on any of the devices. Therefore, one should expect that a portable application that achieves high performance on multiple devices will employ sophisticated runtime tests and chose among multiple code paths according to the capabilities of the actual device used.
本章的目的不是提供 OpenCL 所有编程功能的完整细节。相反,目标是让 CUDA 程序员对 OpenCL 编程模型功能有一个概念性的理解。它还提供了一些基本的主机和内核代码模式,用于快速启动 OpenCL 编码项目。有了这个基础,读者可以立即开始使用 OpenCL 进行编程,并根据需要查阅 OpenCL 规范[KHR]和编程指南 [ NVIDIA、AMD ]。
The objective of this chapter is not to provide full details on all programming features of OpenCL. Rather, the objective is to give a CUDA programmer a conceptual understanding of the OpenCL programming model features. It also provides some basic host and kernel code patterns for jumpstarting an OpenCL coding project. With this foundation, readers can immediately start to program in OpenCL and consult the OpenCL specification [KHR] and programming guides [NVIDIA,AMD] on a need basis.
OpenCL 采用与 CUDA 直接对应的数据并行执行模型。一个 OpenCL 程序由两部分组成:在一个或多个 OpenCL 设备上执行的内核和管理内核执行的主机程序。表 14.1总结了 OpenCL 数据并行概念与其 CUDA 等效概念的映射。与 CUDA 一样,OpenCL 中提交并行执行工作的方式是由主机程序启动内核函数。我们将在第 14.4 节中讨论 OpenCL 主机程序与 CUDA 对应程序相比需要执行的额外内核准备、设备选择和管理工作。
OpenCL employs a data-parallel execution model that has direct correspondence with CUDA. An OpenCL program consists of two parts: kernels that execute on one or more OpenCL devices and a host program that manages the execution of kernels. Table 14.1 summarizes the mapping of OpenCL data parallelism concepts to their CUDA equivalents. Like CUDA, the way to submit work for parallel execution in OpenCL is for the host program to launch kernel functions. We will discuss the additional kernel preparation, device selection, and management work that an OpenCL host program needs to do as compared to its CUDA counterpart in Section 14.4.
表 14.1 OpenCL 和 CUDA 数据并行模型概念之间的映射
Table 14.1 Mapping between OpenCL and CUDA Data Parallelism Model Concepts
| OpenCL 并行性概念 | CUDA 等效项 |
| 核心 | 核心 |
| 主持节目 | 主持节目 |
| NDRange(索引空间) | 网格 |
| 工作项目 | 线 |
| 工作组 | 堵塞 |
当启动内核函数时,其代码由工作项运行,工作项对应于 CUDA 线程。索引空间定义工作项以及数据如何映射到工作项。也就是说,OpenCL 工作项由全局维度索引范围 (NDRanges) 标识。工作项形成与 CUDA 线程块相对应的工作组。工作项在同一工作组可以使用相当于 CUDA 中的__syncthreads()的屏障来相互同步。不同工作组中的工作项无法相互同步,除非终止内核函数并启动新的内核函数。正如我们在第 4 章中讨论的,这种有限的屏障同步范围可以实现透明的扩展。
When a kernel function is launched, its code is run by work items, which correspond to CUDA threads. An index space defines the work items and how data is mapped to the work items. That is, OpenCL work items are identified by global dimension index ranges (NDRanges). Work items form work groups that correspond to CUDA thread blocks. Work items in the same work group can synchronize with each other using barriers that are equivalent to __syncthreads() in CUDA. Work items in different work groups cannot synchronize with each other except by terminating the kernel function and launching a new one. As we discussed in Chapter 4, this limited scope of barrier synchronization enables transparent scaling.
图 14.1说明了 OpenCL 数据并行执行模型。读者应比较图 14.1和图 12.8 的相似之处。 NDRange(CUDA 网格)包含所有工作项(CUDA 线程)。对于此示例,我们假设内核是使用 2D NDRange 启动的。
Figure 14.1 illustrates the OpenCL data-parallel execution model. Readers should compare Figure 14.1 with Figure 12.8 for similarities. The NDRange (CUDA grid) contains all work items (CUDA threads). For this example, we assume that the kernel is launched with a 2D NDRange.
图 14.1 OpenCL 并行执行模型概述。
Figure 14.1 Overview of the OpenCL parallel execution model.
所有工作项都有自己唯一的全局索引值。 OpenCL 和 CUDA 管理这些索引值的方式存在细微差别。在CUDA中,每个线程都有一个blockIdx值和一个threadIdx值。这两个值组合起来形成线程的全局线程 ID 值。例如,如果 CUDA 网格及其块被组织为 2D 数组,则内核代码可以在x维度中形成唯一的全局线程索引值,如blockIdx.x*blockDim.x+threadIdx.x。这些blockIdx和threadIdx值可在 CUDA 内核中作为预定义变量进行访问。
All work items have their own unique global index values. There is a minor difference between OpenCL and CUDA in the way they manage these index values. In CUDA, each thread has a blockIdx value and a threadIdx value. The two values are combined to form a global thread ID value for the thread. For example, if a CUDA grid and its blocks are organized as 2D arrays, the kernel code can form a unique global thread index value in the x dimension as blockIdx.x∗blockDim.x+threadIdx.x. These blockIdx and threadIdx values are accessible in a CUDA kernel as predefined variables.
在 OpenCL 内核中,线程可以通过使用标识维度的参数调用 API 函数get_global_id()来获取其唯一的全局索引值。请参阅表 14.2中的get_global_id(0)条目。这函数get_global_id(0)和get_global_id(1)分别返回x维度和y维度中的全局线程索引值。x维度的全局索引值相当于CUDA中的blockIdx.x*blockDim.x+threadIdx.x。请参阅表 14.2了解get_local_id(0)函数,该函数等效于threadIdx.x。我们没有在表 14.2中显示用于选择高维索引的参数值:1 表示y维度,2 表示z维度。
In an OpenCL kernel, a thread can get its unique global index values by calling an API function get_global_id() with a parameter that identifies the dimension. See the get_global_id(0) entry in Table 14.2. The functions get_global_id(0) and get_global_id(1) return the global thread index values in the x dimension and the y dimension, respectively. The global index value in the x dimension is equivalent to the blockIdx.x∗blockDim.x+threadIdx.x in CUDA. See Table 14.2 for the get_local_id(0) function, which is equivalent to threadIdx.x. We did not show the parameter values in Table 14.2 for selecting the higher-dimension indices: 1 for the y dimension and 2 for the z dimension.
表 14.2 OpenCL 维度和索引到 CUDA 维度和索引的映射
Table 14.2 Mapping of OpenCL Dimensions and Indices to CUDA Dimensions and Indices
| OpenCL API 调用 | 解释 | CUDA 等效项 |
| 获取全局 ID(0) | x维度中工作项的全局索引 | blockIdx.x*blockDim.x+threadIdx.x |
| 获取本地 ID(0) | 工作组内工作项在x维度上的本地索引 | 线程Idx.x |
| 获取全局大小(0) | NDRange 在x维度上的大小 | gridDim.x*blockDim.x |
| 获取本地大小(0) | 每个工作组在x维度上的大小 | 块Dim.x |
OpenCL 内核还可以使用标识其 NDRanges 维度大小的参数来调用 API 函数get_global_size() 。调用get_global_size(0)和get_global_size(1)返回NDRanges 的x和y维度中的工作项总数。请注意,这与 CUDA gridDim值略有不同,后者以块为单位。get_global_size(0)返回值的 CUDA 等效项为gridDim.x ∗ blockDim.x。
An OpenCL kernel can also call an API function get_global_size() with a parameter that identifies the dimensional sizes of its NDRanges. The calls get_global_size(0) and get_global_size(1) return the total number of work items in the x and y dimensions of the NDRanges. Note that this is slightly different from the CUDA gridDim values, which are in terms of blocks. The CUDA equivalent for the get_global_size(0) return value would be gridDim.x ∗ blockDim.x.
与 CUDA 一样,OpenCL 将异构并行计算系统建模为主机和一个或多个 OpenCL 设备。主机是执行主机程序的传统CPU。图 14.2显示了 OpenCL 设备的概念架构。每个设备都包含一个或多个与 CUDA 流式多处理器 (SM) 相对应的计算单元(CU)。然而,计算单元也可以对应于CPU核心或计算加速器中的其他类型的执行单元,例如DSP和FPGA。
Like CUDA, OpenCL models a heterogeneous parallel computing system as a host and one or more OpenCL devices. The host is a traditional CPU that executes the host program. Figure 14.2 shows the conceptual architecture of an OpenCL device. Each device consists of one or more compute units (CUs) that correspond to CUDA streaming multiprocessors (SMs). However, a compute unit can also correspond to CPU cores or other types of execution units in compute accelerators such as DSPs and FPGAs.
图 14.2概念性 OpenCL 设备架构。
Figure 14.2 Conceptual OpenCL device architecture.
每个计算单元又由一个或多个处理元件(PE) 组成,对应于 CUDA 中的流处理器 (SP)。设备上的计算最终发生在各个 PE 中。
Each compute unit, in turn, consists of one or more processing elements (PEs), which corresponds to the streaming processors (SPs) in CUDA. Computation on a device ultimately happens in individual PEs.
与 CUDA 一样,OpenCL 也公开了可供程序员使用的内存类型层次结构。图 14.2说明了这些内存类型:全局、常量、本地和私有。表 14.3总结了 OpenCL 内存类型支持的使用以及这些内存类型到 CUDA 内存类型的映射。 OpenCL 全局内存对应于 CUDA 全局内存。与 CUDA 一样,全局内存可以由主机程序动态分配,并支持主机和设备的读/写访问。
Like CUDA, OpenCL also exposes a hierarchy of memory types that can be used by programmers. Figure 14.2 illustrates these memory types: global, constant, local, and private. Table 14.3 summarizes the supported use of OpenCL memory types and the mapping of these memory types to CUDA memory types. The OpenCL global memory corresponds to the CUDA global memory. Like CUDA, the global memory can be dynamically allocated by the host program and supports read/write access by both host and devices.
表 14.3 OpenCL 内存类型到 CUDA 内存类型的映射
Table 14.3 Mapping of OpenCL Memory Types to CUDA Memory Types
与 CUDA 不同,常量内存可以由主机动态分配。与 CUDA 一样,常量内存支持主机的读/写访问和设备的只读访问。为了支持多个平台,OpenCL 提供了一个设备查询,该查询返回设备支持的常量内存大小。
Unlike CUDA, the constant memory can be dynamically allocated by the host. Like CUDA, the constant memory supports read/write access by the host and read-only access by devices. To support multiple platforms, OpenCL provides a device query that returns the constant memory size supported by the device.
OpenCL本地内存和私有内存到CUDA内存类型的映射更有趣。 OpenCL 本地内存实际上对应于CUDA共享内存。 OpenCL 本地内存可以由主机动态分配,也可以在设备代码中静态分配。与 CUDA 共享内存一样,OpenCL 本地内存不能被主机访问,它支持工作组中所有工作项的共享读/写访问。 OpenCL的私有内存对应于CUDA自动变量。
The mapping of OpenCL local memory and private memory to CUDA memory types is more interesting. The OpenCL local memory actually corresponds to CUDA shared memory. The OpenCL local memory can be dynamically allocated by the host or statically allocated in the device code. Like the CUDA shared memory, the OpenCL local memory cannot be accessed by the host and it supports shared read/write access by all work items in a work group. The private memory of OpenCL corresponds to the CUDA automatic variables.
OpenCL 内核与 CUDA 内核具有相同的基本结构。所有 OpenCL 内核声明都以__kernel关键字开头,相当于CUDA 中的__global__关键字。图 14.3显示了一个执行向量加法的简单 OpenCL 内核。
OpenCL kernels have an identical basic structure as CUDA kernels. All OpenCL kernel declarations start with a __kernel keyword, which is equivalent to the __global__ keyword in CUDA. Figure 14.3 shows a simple OpenCL kernel that performs vector addition.
图 14.3一个简单的 OpenCL 内核示例。
Figure 14.3 A simple OpenCL kernel example.
该函数采用三个参数:指向两个输入数组的指针和一个指向输出数组的指针。函数头中的__global声明表明输入和输出数组都驻留在全局内存中。请注意,该关键字在 OpenCL 中的含义与 CUDA 中的含义相同,只是在 CUDA 中全局关键字后面有两个下划线字符 ( __ )。
The function takes three arguments: pointers to the two input arrays and one pointer to the output array. The __global declarations in the function header indicate that the input and output arrays all reside in the global memory. Note that this keyword has the same meaning in OpenCL as in CUDA, except that there are two underscore characters (__) after the global keyword in CUDA.
内核函数的主体为每个工作项实例化一次。在图14.3中,每个工作项都会调用get_global_id(0)函数接收他们独特的全球指数。然后工作项使用该索引值来选择要处理的数组元素。一旦数组元素索引i形成,内核的其余部分实际上与 CUDA 内核相同。
The body of the kernel function is instantiated once for each work item. In Figure 14.3, each work item calls the get_global_id(0) function to receive their unique global index. This index value is then used by the work item to select the array elements to work on. Once the array element index i is formed, the rest of the kernel is virtually identical to the CUDA kernel.
OpenCL 定义了比 CUDA 复杂得多的设备管理模型。额外的复杂性源于 OpenCL 对多个硬件平台的支持。 OpenCL 支持内核的运行时构建和编译,以最大限度地提高应用程序解决各种 CPU 和 GPU 的可移植性挑战的能力。感兴趣的读者应参阅 OpenCL 规范,以更深入地了解 OpenCL 规范中的工作,以涵盖尽可能多类型的潜在 OpenCL 设备[KHR2011]。
OpenCL defines a much more complex model of device management than CUDA. The extra complexity stems from the OpenCL support for multiple hardware platforms. OpenCL supports runtime construction and compilation of kernels to maximize an application’s ability to address portability challenges across a wide range of CPUs and GPUs. Interested readers should refer to the OpenCL specification for more insight into the work that went into the OpenCL specification to cover as many types of potential OpenCL devices as possible [KHR2011].
在 OpenCL 中,设备是通过上下文进行管理的。 图 14.4说明了 OpenCL 中设备管理的主要概念。为了管理系统中的一个或多个设备,OpenCL 程序员首先创建一个包含这些设备的上下文。上下文本质上是一个地址空间,其中包含系统中 OpenCL 设备可访问的内存位置。这可以通过调用OpenCL API 中的clCreateContext()或clCreateContextFromType()来完成。
In OpenCL, devices are managed through contexts. Figure 14.4 illustrates the main concepts of device management in OpenCL. To manage one or more devices in the system, the OpenCL programmer first creates a context that contains these devices. A context is essentially an address space that contains the accessible memory locations to the OpenCL devices in the system. This can be done by calling either clCreateContext() or clCreateContextFromType() in the OpenCL API.
图 14.4需要 OpenCL 上下文来管理设备。
Figure 14.4 An OpenCL context is needed to manage devices.
图 14.5显示了用于管理 OpenCL 设备的简单主机代码模式。在第 4 行中,我们使用clGetContextInfo()获取保存设备信息所需的字节数 ( parmsz ),该字节数在第 5 行中用于分配足够的内存来保存有关系统中所有可用设备的信息。这是因为保存信息所需的内存量取决于系统中 OpenCL 设备的数量。然后,我们在第 6 行再次调用clGetContextInfo(),其中包含设备信息的大小以及指向已分配内存的指针。设备信息,以便该函数可以将系统中所有设备的信息存入分配的内存中。应用程序还可以使用clGetDeviceIDs() API 函数来确定系统中存在的设备的数量和类型。读者应阅读OpenCL 编程指南,了解这些函数使用的参数的详细信息 [Khronos]。
Figure 14.5 show a simple host code pattern for managing OpenCL devices. In line 4, we use clGetContextInfo() to get the number of bytes needed (parmsz) to hold the device information, which is used in line 5 to allocate enough memory to hold the information about all the devices available in the system. This is because the amount of memory needed to hold the information depends on the number of OpenCL devices in the system. We then call clGetContextInfo() again in line 6 with the size of the device information and a pointer to the allocated memory for the device information so that the function can deposit information on all the devices in the system into the allocated memory. An application could also use the clGetDeviceIDs() API function to determine the number and types of devices that exist in a system. Readers should read the OpenCL Programming Guide on the details of the parameters to be used for these functions [Khronos].
图 14.5创建 OpenCL 上下文和命令队列。
Figure 14.5 Creating OpenCL context and command queue.
要提交工作以供设备执行,主机程序必须首先为设备创建命令队列。这可以通过调用OpenCL API 的clCreateCommandQueue()函数来完成。一旦为设备创建了命令队列,主机代码就可以执行一系列 API 函数调用来插入内核及其执行配置参数进入命令队列。当设备可用于执行下一个内核时,它会删除执行队列头部的内核。
To submit work for execution by a device, the host program must first create a command queue for the device. This can be done by calling the clCreateCommandQueue() function is the OpenCL API. Once a command queue is created for a device, the host code can perform a sequence of API function calls to insert a kernel along with its execution configuration parameters into the command queue. When the device is available for executing the next kernel, it removes the kernel at the head of the queue for execution.
图 14.5显示了一个简单的主机程序,它为设备创建上下文并提交内核以供设备执行。第 2 行显示了创建包含系统中所有 OpenCL 可用设备的上下文的调用。第 4 行调用clGetContextInfo()函数来查询上下文中的设备数量。由于第 2 行要求上下文中包含所有 OpenCL 可用设备,因此应用程序在创建上下文后并不知道上下文中实际包含的设备数量。第 4 行中调用的第二个参数指定所请求的信息是上下文中包含的所有设备的列表。然而,第四个参数是一个指向存储列表的内存缓冲区的指针,它是一个 NULL 指针。这意味着该调用不需要列表本身。原因是应用程序不知道上下文中的设备数量,也不知道保存列表所需的内存缓冲区的大小。
Figure 14.5 shows a simple host program that creates a context for a device and submits a kernel for execution by the device. Line 2 shows a call to create a context that includes all OpenCL available devices in the system. Line 4 calls the clGetContextInfo() function to inquire about the number of devices in the context. Since line 2 asks that all OpenCL available devices be included in the context, the application does not know the number of devices actually included in the context after the context is created. The second argument of the call in line 4 specifies that the information being requested is the list of all devices included in the context. However, the fourth argument, which is a pointer to a memory buffer where the list should be deposited, is a NULL pointer. This means that the call does not want the list itself. The reason is that the application does not know the number of devices in the context and does not know the size of the memory buffer required to hold the list.
相反,第 4 行提供了指向变量parmsz的指针。第 4 行之后,parmsz变量保存容纳上下文中的设备列表所需的缓冲区大小。应用程序现在知道保存上下文中的设备列表所需的内存缓冲区量。它使用parmsz分配内存缓冲区,并将缓冲区的地址分配给第 5 行的指针变量cldevs。
Rather, line 4 provides a pointer to the variable parmsz. After line 4, the parmsz variable holds the size of the buffer needed to accommodate the list of devices in the context. The application now knows the amount of memory buffer needed to hold the list of devices in the context. It allocates the memory buffer using parmsz and assigns the address of the buffer to the pointer variable cldevs at line 5.
第 6 行再次调用clGetContextInfo(),并在第四个参数中使用指向内存缓冲区的指针,在第三个参数中使用缓冲区的大小。由于这是基于第 4 行调用的信息,因此保证缓冲区的大小适合要返回的设备列表。clGetContextInfo函数现在将设备列表信息填充到cldevs指向的内存缓冲区中。
Line 6 calls clGetContextInfo() again with the pointer to the memory buffer in the fourth argument and the size of the buffer in the third argument. Since this is based on the information from the call at line 4, the buffer is guaranteed to be the right size for the list of devices to be returned. The clGetContextInfo function now fills the device list information into the memory buffer pointed to by cldevs.
第 7 行为列表中的第一个 OpenCL 设备创建命令队列。这是通过将cldevs视为一个数组来完成的,其中的元素是系统中 OpenCL 设备的描述符。第 7 行将cldevs[0]作为第二个参数传递到clCreateCommandQueue(0)函数中。因此,该调用为clGetContextInfo()函数返回的列表中的第一个设备生成命令队列。
Line 7 creates a command queue for the first OpenCL device in the list. This is done by treating cldevs as an array of which the elements are descriptors of OpenCL devices in the system. Line 7 passes cldevs[0] as the second argument into the clCreateCommandQueue(0) function. Therefore, the call generates a command queue for the first device in the list returned by the clGetContextInfo() function.
读者可能想知道为什么我们不需要在 CUDA 主机程序中看到这种复杂的 API 调用序列。原因是我们一直在使用 CUDA 运行时 API,它隐藏了系统中只有一个 CUDA 设备的常见情况的所有复杂性。 CUDA 中的内核启动代表 处理所有复杂性主机代码。如果开发人员想要直接访问系统中的所有 CUDA 设备,他或她将需要使用 CUDA 驱动程序 API,其中将使用类似的 API 调用序列。迄今为止,OpenCL 尚未定义与 CUDA 运行时 API 等效的更高级别 API。在这样一个更高级别的接口可用之前,OpenCL 使用起来仍然比 CUDA 运行时 API 繁琐得多。当然,好处是 OpenCL 应用程序可以在各种设备上执行。
Readers may wonder why we did not need to see this complex sequence of API calls in our CUDA host programs. The reason is that we have been using the CUDA runtime API that hides all this complexity for the common case where there is only one CUDA device in the system. The kernel launch in CUDA handles all the complexities on behalf of the host code. If the developer wanted to have direct access to all CUDA devices in the system, he or she would need to use the CUDA driver API, where similar API calling sequences would be used. To date, OpenCL has not defined a higher-level API that is equivalent to the CUDA runtime API. Until such a higher-level interface is available, OpenCL will remain much more tedious to use than the CUDA runtime API. The benefit, of course, is that an OpenCL application can execute on a wide range of devices.
我们现在展示一个基于 DCS 内核的 OpenCL 案例研究,如图12.9所示。本案例研究旨在为 CUDA 程序提供 OpenCL 的实用、自上而下的体验。将内核移植到 OpenCL 的第一步是设计 NDRange 的组织,如图14.5所示。该设计是 CUDA 线程到 OpenCL 工作项以及 CUDA 块到 OpenCL 工作组的简单映射。如图14.6所示,每个工作项将计算最多8个网格点,每个工作组将有64到256个工作项。第 12 章中的所有效率考虑因素也适用于此。
We now present an OpenCL case study based on the DCS kernel in Figure 12.9. This case study is designed to give a CUDA program a practical, top-to-bottom experience with OpenCL. The first step in porting the kernel to OpenCL is to design the organization of the NDRange, which is illustrated in Figure 14.5. The design is a straightforward mapping of CUDA threads to OpenCL work items and CUDA blocks to OpenCL work groups. As shown in Figure 14.6, each work item will calculate up to eight grid points and each work group will have 64 to 256 work items. All the efficiency considerations in Chapter 12 also apply here.
图 14.6 DCS 内核版本 3 NDRange 配置。
Figure 14.6 DCS kernel version 3 NDRange configuration.
工作组分配给 CU 的方式与 CUDA 块分配给 SM 的方式相同。这种分配如图所示图 14.7。人们可以使用第 6 章和第 12章中使用的相同方法来导出高性能 OpenCL DCS 内核。尽管语法不同,但开发高性能 OpenCL 内核所涉及的底层思维过程与 CUDA 非常相似。
The work groups are assigned to the CUs the same way that CUDA blocks are assigned to the SMs. Such assignment is illustrated in Figure 14.7. One can use the same methodology used in Chapters 6 and 12 to derive high-performance OpenCL DCS kernel. Although the syntax is different, the underlying thought process involved in developing a high-performance OpenCL kernel is very much the same as CUDA.
图 14.7将 DCS NDRange 映射到 OpenCL 设备。
Figure 14.7 Mapping DCS NDRange to OpenCL device.
OpenCL 内核函数实现与 CUDA 实现紧密匹配。图 14.8显示了主要差异。一是OpenCL 中的__kernel关键字与CUDA 中的__global关键字。主要区别在于数据访问索引的计算方式。在这种情况下,OpenCL get_global_id(0)函数返回 CUDA blockIdx.x*blockDim.x+threadIdx.x的等效项。
The OpenCL kernel function implementation matches closely the CUDA implementation. Figure 14.8 shows the key differences. One is the __kernel keyword in OpenCL versus the __global keyword in CUDA. The main difference lies in the way the data access indices are calculated. In this case, the OpenCL get_global_id(0) function returns the equivalent of CUDA blockIdx.x∗blockDim.x+threadIdx.x.
图 14.8 OpenCL 和 CUDA 中的数据访问索引。
Figure 14.8 Data access indexing in OpenCL and CUDA.
图 14.9显示了 OpenCL 内核的内部循环。读者应该将此内部循环与图 12.9中的 CUDA 代码进行比较。唯一的区别是__rsqrt()调用已更改为native_rsqrt()调用,这是在特定设备上使用数学函数的硬件实现的 OpenCL 方式。
Figure 14.9 shows the inner loop of the OpenCL kernel. Readers should compare this inner loop with the CUDA code in Figure 12.9. The only difference is that the __rsqrt() call has been changed to the native_rsqrt() call, the OpenCL way for using the hardware implementation of math functions on a particular device.
图 14.9 OpenCL DCS 内核的内循环。
Figure 14.9 Inner loop of the OpenCL DCS kernel.
OpenCL采用动态编译模型。与 CUDA 不同,主机程序可以在运行时显式编译和创建内核程序。图 14.10中针对 DCS 内核对此进行了说明。第 1 行将整个 OpenCL DCS 内核源代码声明为字符串。 3号线提供通过调用clCreateProgramWith Source()函数将源代码字符串发送到 OpenCL 运行时系统。第 4 行设置运行时编译过程的编译器标志。第 5 行调用运行时编译器来构建程序。第 6 行请求 OpenCL 运行时创建内核及其数据结构,以便能够正确启动。第 6 行之后,clkern指向可以提交到命令队列执行的内核。
OpenCL adopts a dynamic compilation model. Unlike CUDA, the host program can explicitly compile and create a kernel program at runtime. This is illustrated in Figure 14.10 for the DCS kernel. Line 1 declares the entire OpenCL DCS kernel source code as a string. Line 3 delivers the source code string to the OpenCL runtime system by calling the clCreateProgramWith Source() function. Line 4 sets up the compiler flags for the runtime compilation process. Line 5 invokes the runtime compiler to build the program. Line 6 requests that the OpenCL runtime create the kernel and its data structures so that it can be properly launched. After line 6, clkern points to the kernel that can be submitted to a command queue for execution.
图 14.10构建 OpenCL 内核。
Figure 14.10 Building OpenCL kernel.
图 14.11显示了启动 DCS 内核的主机代码。假设图 14.5中用于管理 OpenCL 设备的主机代码 已被处决。第 1 行和第 2 行为能源网格数据和原子信息分配内存。 clCreateBuffer ()函数对应于cudaMalloc()函数。通过将atominfo数组的访问模式设置为只读来隐式请求常量内存。请注意,每个内存缓冲区都与一个上下文相关联,该上下文由clCreateBuffer()函数调用的第一个参数指定。
Figure 14.11 shows the host code that launches the DCS kernel. It assumes that the host code for managing OpenCL devices in Figure 14.5 has been executed. Lines 1 and 2 allocate memory for the energy grid data and the atom information. The clCreateBuffer() function corresponds to the cudaMalloc() function. The constant memory is implicitly requested by setting the mode of access to read only for the atominfo array. Note that each memory buffer is associated with a context, which is specified by the first argument to the clCreateBuffer() function call.
图 14.11用于内核启动和的 OpenCL 主机代码
Figure 14.11 OpenCL host code for kernel launch and
图 14.11中的第 3-6 行设置了要传递到内核函数中的参数。在 CUDA 中,内核函数使用扩展为<<<>>> 的C 函数调用语法启动,后面跟着常规参数列表。在 OpenCL 中,没有对内核函数的显式调用。因此,需要使用clSetKernelArg()函数来设置内核函数的参数。
Lines 3–6 in Figure 14.11 set up the arguments to be passed into the kernel function. In CUDA, the kernel functions are launched with C function call syntax extended with <<<>>>, which is followed by the regular list of arguments. In OpenCL, there is no explicit call to kernel functions. Therefore, one needs to use the clSetKernelArg() functions to set up the arguments for the kernel function.
图 14.11中的第 8 行提交 DCS 内核以供启动。clEnqueueNDRangeKernel()函数的参数指定将执行内核的设备的命令队列、指向内核的指针以及 NDRange 的全局和本地大小。第 9 行和第 10 行检查是否有错误。
Line 8 in Figure 14.11 submits the DCS kernel for launch. The arguments to the clEnqueueNDRangeKernel() function specifies the command queue for the device that will execute the kernel, a pointer to the kernel, and the global and local sizes of the NDRange. Lines 9 and 10 check for errors if any.
第 11 行将输出数据的内容传输回主机存储器中的能量阵列。 OpenCL clEnqueueReadBuffer()将数据从设备内存复制到主机内存,并对应于cudaMemcpy()函数的设备和主机方向。
Line 11 transfers the contents of the output data back into the energy array in the host memory. The OpenCL clEnqueueReadBuffer() copies data from the device memory to the host memory and corresponds to the device the host direction of the cudaMemcpy() function.
clReleaseMemObject ()函数比cudaFree()稍微复杂一些。 OpenCL 维护数据对象的引用计数。 OpenCL 主机程序模块可以保留 ( clRetainMemObject() ) 和释放 ( clReleaseMemObject() ) 数据对象。请注意,clCreateBuffer()也用作保留调用。每次调用保留时,对象的引用计数都会增加。每次发布调用时,引用计数都会递减。当对象的引用计数达到 0 时,该对象将被释放。这样,即使应用程序的其他部分不再需要该对象并因此释放了该对象,库模块也可以“挂起”该内存对象。
The clReleaseMemObject() function is a little more sophisticated than cudaFree(). OpenCL maintains a reference count for data objects. OpenCL host program modules can retain (clRetainMemObject()) and release (clReleaseMemObject()) data objects. Note that clCreateBuffer() also serves as a retain call. With each retain call, the reference count of the object is incremented. With each release call, the reference count is decremented. When the reference count for an object reaches 0, the object is freed. This way, a library module can “hang on” to a memory object even though the other parts of the application no longer need the object and thus have released the object.
OpenCL 是一种标准化的跨平台 API,旨在支持异构计算系统上的可移植并行应用程序开发。与 CUDA 一样,OpenCL 解决复杂的内存层次结构和数据并行执行问题。它很大程度上借鉴了 CUDA 驱动程序 API 经验。这就是 CUDA 程序员发现这些的原因熟悉 OpenCL 的各个方面。我们通过 OpenCL 数据并行模型概念、NDRange API 调用和内存类型与其 CUDA 等效项的映射看到了这一点。
OpenCL is a standardized, cross-platform API designed to support portable parallel application development on heterogeneous computing systems. Like CUDA, OpenCL addresses complex memory hierarchies and data-parallel execution. It draws heavily on the CUDA driver API experience. This is the reason why a CUDA programmer finds these aspects of OpenCL familiar. We have seen this through the mappings of the OpenCL data parallelism model concepts, NDRange API calls, and memory types to their CUDA equivalents.
另一方面,OpenCL具有更复杂的设备管理模型,反映了其对多平台和多供应商可移植性的支持。虽然 OpenCL 标准旨在支持跨不同供应商生产的设备的代码可移植性,但这种可移植性并不是免费的。 OpenCL 程序必须准备好处理更大的硬件多样性,因此将表现出更高的复杂性。我们看到 OpenCL 设备管理模型、OpenCL 内核编译模型和 OpenCL 内核启动比 CUDA 模型复杂得多。
On the other hand, OpenCL has a more complex device management model that reflects its support for multiplatform and multivendor portability. While the OpenCL standard is designed to support code portability across devices produced by different vendors, such portability does not come for free. OpenCL programs must be prepared to deal with much greater hardware diversity and thus will exhibit more complexity. We see that the OpenCL device management model, the OpenCL kernel compilation model, and the OpenCL kernel launch are much more complex than their CUDA counterparts.
我们还没有涵盖 OpenCL 的所有编程功能。我们鼓励读者阅读 OpenCL 规范[KHR2011]和教程[Khronos]以了解更多 OpenCL 功能。特别建议读者特别关注设备查询、对象查询和任务并行模型。
We have by no means covered all the programming features of OpenCL. Readers are encouraged to read the OpenCL specification [KHR2011] and tutorials [Khronos] for more OpenCL features. In particular, we recommend that readers pay special attention to the device query, object query, and task parallelism model.
14.1 使用附录 A中的代码库以及第3、4、5和6章中的示例来开发 OpenCL 版本的矩阵-矩阵乘法应用程序。
14.1 Use the code base in Appendix A and examples in Chapters 3, 4, 5, and 6 to develop an OpenCL version of the matrix–matrix multiplication application.
14.2 阅读 OpenCL 规范的“OpenCL 平台层”部分。将平台查询 API 函数与您在 CUDA 中学到的内容进行比较。
14.2 Read the “OpenCL Platform Layer” section of the OpenCL specification. Compare the platform querying API functions with what you have learned in CUDA.
14.3 阅读 OpenCL 规范的“内存对象”部分。将对象创建和访问 API 函数与您在 CUDA 中学到的内容进行比较。
14.3 Read the “Memory Objects” section of the OpenCL specification. Compare the object creation and access API functions with what you have learned in CUDA.
14.4 阅读 OpenCL 规范的“内核对象”部分。将内核创建和启动 API 函数与您在 CUDA 中学到的内容进行比较。
14.4 Read the “Kernel Objects” section of the OpenCL specification. Compare the kernel creation and launching API functions with what you have learned in CUDA.
14.5 阅读 OpenCL 规范的“OpenCL 编程语言”部分。将关键字和类型与您在 CUDA 中学到的内容进行比较。
14.5 Read the “OpenCL Programming Language” section of the OpenCL specification. Compare the keywords and types with what you have learned in CUDA.
1.AMD OpenCL 资源。位于:< http://developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx >。
1. AMD OpenCL Resources. Available at: <http://developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx>.
2. Khronos Group,OpenCL 规范版本 1.1,rev44。参见:< http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf >,2011 年。
2. Khronos Group, The OpenCL Specification version 1.1, rev44. Available at: <http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf>, 2011.
3. Khronos OpenCL 示例、教程等,可访问:< http://www.khronos.org/developers/resources/opencl/ >。
3. Khronos OpenCL samples, tutorials, etc., Available at: <http://www.khronos.org/developers/resources/opencl/>.
4.NVIDIA OpenCL 资源。位于:< http://www.nvidia.com/object/cuda_opencl.html >。
4. NVIDIA OpenCL Resources. Available at: <http://www.nvidia.com/object/cuda_opencl.html>.
15.1 OpenACC 与 CUDA C
15.1 OpenACC Versus CUDA C
15.2 执行模型
15.2 Execution Model
15.3 内存模型
15.3 Memory Model
15.4 基本 OpenACC 程序
15.4 Basic OpenACC Programs
15.5 OpenACC 的未来方向
15.5 Future Directions of OpenACC
15.6 练习
15.6 Exercises
OpenACC 应用程序编程接口 (API) 提供了一组编译器指令、库例程和环境变量,可用于编写在加速器设备(包括 GPU)上运行的数据并行 FORTRAN、C 和 C++ 程序。它是宿主语言的扩展。 OpenACC 规范最初由 Portland Group (PGI)、Cray Inc. 和 NVIDIA 在 CAPS enterprise 的支持下开发。本章向已经熟悉 CUDA C 的并行程序员介绍 OpenACC。
The OpenACC Application Programming Interface (API) provides a set of compiler directives, library routines, and environment variables that can be used to write data-parallel FORTRAN, C, and C++ programs that run on accelerator devices, including GPUs. It is an extension to the host language. The OpenACC specification was initially developed by the Portland Group (PGI), Cray Inc., and NVIDIA, with support from CAPS enterprise. This chapter presents an introduction to OpenACC to parallel programmers who are already familiar with CUDA C.
OpenACC 和 CUDA C 之间的一大区别是 OpenACC 中编译器指令的使用。为了了解编译器指令是什么以及使用编译器指令的优点,让我们看一下图 15.1中的第一个 OpenACC 程序,它执行我们之前已经见过的矩阵乘法。
One big difference between OpenACC and CUDA C is the use of compiler directives in OpenACC. To understand what a compiler directive is and the advantages of using compiler directives, let’s take a look at our first OpenACC program in Figure 15.1, which does the matrix multiplication that we’ve already seen before.
图 15.1我们的第一个 OpenACC 程序。
Figure 15.1 Our first OpenACC program.
图中的代码几乎与顺序版本相同,除了第 4 行和第 6 行带有#pragma的两行之外。在 C 和 C++ 中, #pragma指令是向编译器提供非编译器信息的方法。以标准语言指定。 OpenACC 使用编译器指令机制来扩展基本语言。在此示例中,第 4 行的#pragma告诉编译器为第 5-16 行的i循环生成代码,以便循环迭代在加速器上并行执行。 copyin子句和copyout子句指定矩阵数据应如何在主机和加速器之间传输。第 6 行的#pragma指示编译器将内部j循环映射到加速器上的第二级并行性。
The code in the figure is almost identical to the sequential version, except for the two lines with #pragma at lines 4 and 6. In C and C++, the #pragma directive is the method to provide, to the compiler, information that is not specified in the standard language. OpenACC uses the compiler directive mechanism to extend the base language. In this example, the #pragma at line 4 tells the compiler to generate code for the i loop at lines 5-16 so that the loop iterations are executed in parallel on the accelerator. The copyin clause and the copyout clause specify how the matrix data should be transferred between the host and the accelerator. The #pragma at line 6 instructs the compiler to map the inner j loop to the second level of parallelism on the accelerator.
与 CUDA C/C++/FORTRAN 相比,通过使用编译器指令,OpenACC 给程序员带来了很多好处:
Compared with CUDA C/C++/FORTRAN, by using compiler directives, OpenACC brings quite a few benefits to programmers:
• OpenACC 程序员通常可以从编写顺序版本开始,然后使用 OpenACC 指令注释其顺序程序。他们可以将大部分繁重的工作留给 OpenACC 编译器。主机和加速器内存之间的数据传输、数据缓存、内核启动、线程调度和并行映射的细节都由 OpenACC 编译器和运行时处理。借助 OpenACC,异构程序员使用加速器的门槛变得更低。
• OpenACC programmers can often start with writing a sequential version and then annotate their sequential program with OpenACC directives. They can leave most of the heavy lifting to the OpenACC compiler. The details of data transfer between host and accelerator memories, data caching, kernel launching, thread scheduling, and parallelism mapping are all handled by OpenACC compiler and runtime. The entry barrier of heterogeneous programmers for accelerators becomes much lower with OpenACC.
• OpenACC 提供了将遗留应用程序迁移到加速器的增量路径。这很有吸引力,因为添加指令对现有代码的干扰比其他方法要少。一些现有的科学应用程序规模很大,它们的开发人员不想为加速器重写它们。 OpenACC 使这些开发人员可以让他们的应用程序看起来像普通的 C、C++ 或 FORTRAN 代码,并且他们可以一次将指令放入代码中需要的地方。
• OpenACC provides an incremental path for moving legacy applications to accelerators. This is attractive because adding directives disturbs the existing code less than other approaches. Some existing scientific applications are large and their developers don’t want to rewrite them for accelerators. OpenACC lets these developers keep their applications looking like normal C, C++, or FORTRAN code, and they can go in and put the directives in the code where they are needed one place a time.
• 非 OpenACC 编译器不需要理解和处理 OpenACC 指令,因此它可以忽略这些指令并照常编译程序的其余部分。通过使用编译器指令方法,OpenACC 允许程序员以这样的方式编写 OpenACC 程序:当忽略指令时,程序仍然可以顺序运行,并给出与程序并行运行时相同的结果。具有等效顺序版本的并行程序比不具有等效顺序版本的并行程序更容易调试。图 15.1中的矩阵乘法代码具有此属性 - 无论第 4 行和第 6 行是否被执行,代码都会给出相同的结果。此类程序本质上同时具有并行版本和顺序版本。 OpenACC 允许加速和非加速系统使用通用代码库。
• A non-OpenACC compiler is not required to understand and process OpenACC directives, therefore it can just ignore the directives and compile the rest of the program as usual. By using the compiler directive approach, OpenACC allows a programmer to write OpenACC programs in such a way that when the directives are ignored, the program can still run sequentially and gives the same result as when the program is run in parallel. Parallel programs that have equivalent sequential versions are much easier to debug than those that don’t have. The matrix multiplication code in Figure 15.1 has this property—the code gives the same result regardless of whether lines 4 and 6 are honored or not. Such programs essentially have both the parallel version and the sequential version in one. OpenACC permits a common code base for accelerated and nonaccelerated enabled systems.
OpenACC 用户需要注意以下问题:
OpenACC users need to be aware of the following issues:
• 某些 OpenACC 指令是对 OpenACC 编译器的提示,它可能会或可能无法充分利用此类提示。因此,OpenACC程序的性能更多地取决于所使用的OpenACC编译器的能力。另一方面,CUDA C/C++/FORTRAN 程序显式地表达并行性,并且较少依赖编译器来实现并行性能。
• Some OpenACC directives are hints to the OpenACC compiler, which may or may not be able to take full advantage of such hints. Therefore, the performance of an OpenACC program depends more on the capability of the OpenACC compiler used. On the other hand, a CUDA C/C++/FORTRAN program expresses parallelism explicitly and relies less on the compiler for parallel performance.
• 虽然可以编写给出与忽略指令时相同的执行结果的 OpenACC 程序,但该属性并不自动适用于所有 OpenACC 程序。如果忽略编译器指令,某些 OpenACC 程序可能会给出不同的结果,或者某些程序可能无法正常工作。
• While it is possible to write OpenACC programs that give the same execution result as when the directives are ignored, this property does not hold automatically for all OpenACC programs. If compiler directives are ignored, some OpenACC programs may give different results or some may not work correctly.
在本章的其余部分,我们首先解释 OpenACC 使用的执行模型和内存模型。然后,我们将通过一些具体的代码示例来说明一些更常用的 OpenACC 指令和 API 的用法。我们还展示了 OpenACC 实现如何将并行区域和内核区域映射到 CUDA GPU 架构。我们相信某些幕后知识可以帮助用户获得OpenACC 实现的更好性能。我们通过概述 OpenACC 的未来发展方向来结束本文。
In the rest of this chapter, we first explain the execution model and memory model used by OpenACC. We then walk through some concrete code examples to illustrate usage of some of the more commonly used OpenACC directives and APIs. We also show how an OpenACC implementation can map parallel regions and kernel regions to the CUDA GPU architecture. We believe certain behind-the-scenes knowledge can help users to get the better performance out of OpenACC implementations. We conclude this article by outlining the future directions we see OpenACC going in.
OpenACC 目标机具有主机和附加的加速器设备,例如 GPU。大多数加速器设备可以支持多个级别的并行性。图 15.2说明了支持三个并行级别的典型加速器。在最外层的粗粒度级别,有多个执行单元。每个执行单元内都有多个线程。在最内层,每个线程都能够执行向量操作。目前,除了线程分叉和加入之外,OpenACC 不承担加速器上的任何同步功能。一旦工作被分配到执行单元之间,它们就会从开始到结束并行执行。类似地,一旦工作被分配到执行单元内的线程之间,线程就会并行执行。向量运算是同步执行的。
The OpenACC target machine has a host and an attached accelerator device, such as a GPU. Most accelerator devices can support multiple levels of parallelism. Figure 15.2 illustrates a typical accelerator that supports three levels of parallelism. At the outermost coarse-grain level, there are multiple execution units. Within each execution unit, there are multiple threads. At the innermost level, each thread is capable of executing vector operations. Currently, OpenACC does not assume any synchronization capability on the accelerator, except for thread forking and joining. Once work is distributed among the execution units, they will execute in parallel from start to finish. Similarly, once work is distributed among the threads within an execution unit, the threads execute in parallel. Vector operations are executed in lockstep.
图 15.2典型的加速器装置
Figure 15.2 A typical accelerator device
OpenACC 程序开始在主机单线程上执行(图 15.3)。当主机线程遇到并行或内核构造时,将创建包含该构造中包含的所有代码的并行区域或内核区域,并在加速器设备上启动。并行区域或内核区域可以选择与主机线程异步执行,并在未来的同步点与主机线程连接。并行区域完全在加速器设备上执行。内核区域可以包含一系列内核,每个内核都在加速器设备上执行。
An OpenACC program starts its execution on the host single-threaded (Figure 15.3). When the host thread encounters a parallel or a kernels construct, a parallel region or a kernels region that comprises all the code enclosed in the construct is created and launched on the accelerator device. The parallel region or kernels region can optionally execute asynchronously with the host thread and join with the host thread at a future synchronization point. The parallel region is executed entirely on the accelerator device. The kernels region may contain a sequence of kernels, each of which is executed on the accelerator device.
图 15.3 OpenACC 执行模型。
Figure 15.3 OpenACC execution model.
内核执行遵循 fork-join 模型。一组组用于执行每个内核。一组工作人员可以分叉来执行属于一个帮派的并行工作共享循环。循环完成后,工人就会解散。通常,一组在一个执行单元上执行,而一个工作线程在执行单元内的一个线程上运行。
The kernel execution follows a fork-join model. A group of gangs are used to execute each kernel. A group of workers can be forked to execute a parallel work-sharing loop that belongs to a gang. The workers are disbanded when the loop is done. Typically a gang executes on one execution unit, and a worker runs on one thread within an execution unit.
程序员可以指示并行区域或内核区域内的工作如何分布在加速器上的不同并行级别之间。
The programmer can instruct how the work within a parallel region or a kernels region is to be distributed among the different levels of parallelism on the accelerator.
在 OpenACC 内存模型中,主机内存和设备内存被视为分开的。假设主机不能直接访问设备存储器,设备也不能直接访问主机存储器。这是为了确保OpenACC编程模型能够支持广泛的加速器设备,包括目前大多数不具备统一内存访问能力的GPUGPU 和 CPU。 NVIDIA 在 CUDA 4.0 中引入的统一虚拟寻址和 GPUDirect 允许主机内存和设备内存使用单一虚拟地址空间,并允许不同 GPU 之间直接跨设备内存访问。然而,跨主机和设备内存访问仍然是不可能的。
In an OpenACC memory model, the host memory and the device memory are treated as separated. It is assumed that the host is not able to access device memory directly and the device is not able to access host memory directly. This is to ensure that the OpenACC programming model can support a wide range of accelerator devices, including most of the current GPUs that do not have the capability of unified memory access between GPUs and CPUs. The unified virtual addressing and the GPUDirect introduced by NVIDIA in CUDA 4.0 allow a single virtual address space for both host memory and device memory and allow direct cross-device memory access between different GPUs. However, cross-host and device memory access is still not possible.
就像在 CUDA C/C++ 中一样,在 OpenACC 中,输入数据需要在内核启动之前从主机传输到设备,并且结果数据需要从设备传输回主机。然而,与 CUDA C/C++ 中程序员需要通过 API 调用显式编码数据移动不同,在 OpenACC 中,他们只需注释哪些内存对象需要传输,如图15.1中的第 4 行所示。 OpenACC 编译器将自动生成内存分配、复制和解除分配的代码。
Just like in CUDA C/C++, in OpenACC input data needs to be transferred from the host to the device before kernel launches and result data needs to be transferred back from the device to the host. However, unlike in CUDA C/C++ where programmers need to explicitly code data movement through API calls, in OpenACC they can just annotate which memory objects need to be transferred, as shown by line 4 in Figure 15.1. The OpenACC compiler will automatically generate code for memory allocation, copying, and de-allocation.
OpenACC 对加速器设备上的内存采用相当弱的一致性模型。尽管加速器上的数据可以被所有执行单元共享,但 OpenACC 并没有提供一种可靠的方法来允许一个执行单元消耗另一个执行单元产生的数据。有两个原因。首先,回想一下 OpenACC 不提供任何执行单元之间的同步机制。其次,不同执行单元之间的记忆不一致。尽管某些硬件提供了显式无效和更新缓存的指令,但它们并未在 OpenACC 级别公开。因此,在 OpenACC 中,不同的执行单元应该在不相交的内存集上工作。执行单元内的线程也可以共享内存,并且线程具有一致的内存。然而,OpenACC 目前仅在线程分叉和连接处强制使用内存栅栏,这也是 OpenACC 为线程提供的唯一同步。虽然设备内存模型可能看起来非常有限,但实际上并非如此。对于无数据竞争的 OpenACC 数据并行应用程序,弱内存模型工作得很好。
OpenACC adopts a fairly weak consistency model for memory on the accelerator device. Although data on the accelerator can be shared by all execution units, OpenACC does not provide a reliable way to allow one execution unit to consume the data produced by another execution unit. There are two reasons for this. First, recall OpenACC does not provide any mechanism for synchronization between execution units. Second, memories between different execution units are not coherent. Although some hardware provides instructions to explicitly invalidate and update cache, they are not exposed at the OpenACC level. Therefore, in OpenACC, different execution units are expected to work on disjoint memory sets. Threads within an execution unit can also share memory and threads have coherent memory. However, OpenACC currently only mandates a memory fence at the thread fork and join, which are also the only synchronizations OpenACC provides for threads. While the device memory model may appear very limiting, it is not so in practice. For data–race free OpenACC data-parallel applications, the weak memory model works quite well.
在本节中,我们将深入探讨如何编写基本 OpenACC 程序的详细信息。
In this section, we will dive in to details of how one can write basic OpenACC programs.
图15.1中第4行的单个#pragma实际上是两个#pragma的语法糖(图15.4)。我们在这里解释并行结构,并将在下一节中解释循环结构。
The single #pragma at line 4 in Figure 15.1 is actually a syntax sugar of two #pragma (Figure 15.4). We explain the parallel construct here and will explain the loop construct in the next section.
图15.4 #pragma acc并行循环是#pragma acc并行和#pragma acc循环合二为一。
Figure 15.4 #pragma acc parallel loop is #pragma acc parallel and #pragma acc loop in one.
并行构造是两个构造之一(另一个是内核构造),可用于指定程序的哪一部分将在加速器上执行。当程序遇到并行构造时,该构造的结构化块(也称为并行区域)内的代码的执行将移至加速器。加速器上的工作组被创建来执行并行区域,如图15.3所示。最初,每个帮派中只有一名工作人员(让我们称之为帮派领导)将执行并行区域。此时其他工作人员在概念上处于闲置状态。当内部层面有更多并行工作时,就会部署它们。 gang的数量可以通过num_gangs子句指定,每个gang内的worker数量可以通过num_workers子句指定。
The parallel construct is one of the two constructs (the other is the kernels construct) that can be used to specify which part of the program is to be executed on the accelerator. When a program encounters a parallel construct, the execution of the code within the structured block of the construct (also called a parallel region) is moved to the accelerator. Gangs of workers on the accelerator are created to execute the parallel region, as shown in Figure 15.3. Initially only one worker (let us call it a gang lead) within each gang will execute the parallel region. The other workers are conceptually idle at this point. They will be deployed when there is more parallel work at an inner level. The number of gangs can be specified by the num_gangs clause, and the number of workers within each gang can be specified by the num_workers clause.
在这个例子中(图15.5),总共创建了1,024×32=32,768个工人。a=23语句将由 1,024 个帮派线索并行且冗余地执行。您可能会问为什么有人想要编写这样的加速器代码。当并行结构与循环结构一起使用时,它的用处就会变得清晰。
In this example (Figure 15.5), a total of 1,024×32=32,768 workers are created. The a=23 statement will be executed in parallel and redundantly by 1,024 gang leads. You may ask why anyone would want to write accelerator code like this. The usefulness of the parallel construct will become clear when it is used with the loop construct.
图 15.5并行区域中的冗余执行。
Figure 15.5 Redundant execution in a parallel region.
如果并行构造没有显式num_gangs子句或显式num_workers子句,则实现将在运行时选择数字。一旦并行区域开始执行,在并行区域执行期间,gang的数量和每个gang内的worker的数量保持不变。这类似于 CUDA,一旦启动内核,网格中的块数和块中的线程数就无法更改。
If a parallel construct does not have an explicit num_gangs clause or an explicit num_workers clause, then the implementation will pick the numbers at runtime. Once a parallel region starts executing, the number of gangs and the number of workers within each gang remain fixed during the execution of the parallel region. This is similar to CUDA in which once a kernel is launched, the number of blocks in the grid and the number of threads in a block cannot be changed.
假设您有一个循环,其中所有迭代都可以独立并行执行,并且您希望通过在加速器上运行它来加速其执行速度。你能写出如图15.6所示的代码吗?
Suppose you have a loop where all the iterations can be independently executed in parallel and you want to speed up its execution by running it on the accelerator. Can you write the code like Figure 15.6?
图 15.6并行区域中的未注释循环也被冗余执行。
Figure 15.6 An unannotated loop in a parallel region is also redundantly executed.
正如我们从上一节中了解到的,虽然循环将在加速器上执行,但您不会获得任何加速,因为所有 2,048 次迭代将由 1,024 个组引线按顺序且冗余地执行。为了获得加速,您需要在各个组之间分配 2,048 次迭代。为此,您需要使用 gang循环结构,如图15.7所示。
As we learned from the previous section, although the loop will be executed on the accelerator, you won’t get any speedup because all 2,048 iterations will be executed sequentially and redundantly by the 1,024 gang leads. To get speedup, you need to distribute the 2,048 iterations among the gangs. And to do that, you need to use the gang loop construct, as shown in Figure 15.7.
图 15.7使用循环结构使循环工作共享。
Figure 15.7 Use the loop construct to make a loop work-shared.
组循环构造始终与循环相关联。组循环结构是一种工作共享结构。编译器和运行时将确保组循环的迭代在循环构造中遇到的所有组线索之间共享。在图 15.7中,由于 1,024 个帮派线索将遇到循环构造,因此每个线索将被分配两次迭代。现在并行循环的执行将更加高效并且可能实现加速。
A gang loop construct is always associated with a loop. The gang loop construct is a work-sharing construct. The compiler and runtime will make sure that the iterations of a gang loop are shared among all gang leads encountered in the loop construct. In Figure 15.7, because 1,024 gang leads will encounter the loop construct, each lead will be assigned two iterations. Now the execution of the parallel loop will be more efficient and likely to achieve speedup.
如果您还有一个可以并行执行的内部循环怎么办?嗯,这就是工作循环结构可以发挥作用的时候。
What if you also have an inner loop that can be executed in parallel? Well, that is when the worker loop construct can be beneficial.
工作循环结构也是一个工作共享结构。编译器和运行时将确保工作循环的迭代在组内的所有工作人员之间共享。在图 15.8中,一个组中的 32 个工作人员将共同处理分配给该组的i循环的两次迭代中的j循环的 512 次迭代。总共 2,048×512=1 M 个foo()实例将在顺序版本或并行版本中执行。在并行版本中,使用了 1,024×32=32 K 个工作线程,每个工作线程将执行 1 M÷32 K=32 个foo()实例。
The worker loop construct is also a work-sharing construct. The compiler and runtime will make sure that the iterations of a worker loop are shared among all workers within a gang. In Figure 15.8, the 32 workers in a gang will work collectively on the 512 iterations of the j loop in each of the two iterations of the i loop assigned to the gang. A total of 2,048×512=1 M instances of foo() will be executed in the sequential version or the parallel version. In the parallel version, 1,024×32=32 K workers are used and each worker will execute 1 M÷32 K=32 instances of foo().
图15.8使用worker子句。
Figure 15.8 Using the worker clause.
熟悉CUDA C的读者可能会问图15.8中的OpenACC代码与图15.9中的CUDA C代码有何不同?他们可能会想,难道我只写CUDA C版本就可以达到同样的效果吗?
Readers who are familiar with CUDA C may ask how the OpenACC code in Figure 15.8 is different from the CUDA C code in Figure 15.9? They may wonder, can’t I just write the CUDA C version and achieve the same effect?
图 15.9图 15.8中并行区域的可能的 CUDA C 实现。
Figure 15.9 A possible CUDA C implementation of the parallel region in Figure 15.8.
是的,在这种情况下它们是相似的。事实上,一些 OpenACC 实现实际上可能会从图 15.8中的OpenACC 版本生成图15.9中的 CUDA C 版本,并将其传递给CUDA C 编译器。但 OpenACC 版本的一个明显优势是它比 CUDA C 版本更接近顺序版本。只需要修改一些代码。与 CUDA C 相比,OpenACC 对加速器上的最终代码的控制较少。然而,OpenACC 的优势在于它能够处理更复杂的现有顺序代码,特别是当您想要移植到加速器上执行的原始代码不是完美嵌套的循环嵌套时。
Yes, they are similar in this case. And as a matter of fact, some OpenACC implementations may actually generate the CUDA C version in Figure 15.9 from the OpenACC version in Figure 15.8 and pass it to the CUDA C compiler. But one clear advantage of the OpenACC version is that it is much closer to the sequential version than the CUDA C version. Only a few code modifications are required. Compared with CUDA C, OpenACC gives you less control of how the final code on the accelerator will be. However, the strength of OpenACC lies in its ability to tackle more complicated existing sequential code, especially when the original code you want to port to execute on the accelerator is not a perfectly nested loop nest.
以图15.10中的代码片段为例。我们假设两个循环都是并行循环。如果您想将整个代码片段移至在加速器上执行,使用 OpenACC 就容易多了。如果当多个帮派头目重复执行语句 1、2、5、6、9 和 10 时,代码可以给出相同的结果,那么您可以执行图 15.11所示的操作。
Take the code snippet in Figure 15.10 for example. Let’s assume both loops are parallel loops. If you want to move the whole code snippet to execute on the accelerator, it is much easier with OpenACC. If the code can give you the same result when statements 1, 2, 5, 6, 9, and 10 are executed redundantly by multiple gang leaders, then you can do what is shown in Figure 15.11.
图 15.10一段不平凡的代码。
Figure 15.10 A piece of nontrivial code.
图 15.11使用 OpenACC 更容易移植(第 1 部分)。
Figure 15.11 Porting is easier with OpenACC (Part 1).
图 15.11中的第一个编译指示创建了 32 个组。语句 1 和 2 由所有团伙执行。请注意,在原始代码中,这些语句仅执行一次。但是,在注释之后,编译器将生成执行这些语句 32 次的代码。这相当于将语句移动到循环中。只要该语句可以额外执行多次而不会产生不正确的结果,这不是问题。
The first pragma in Figure 15.11 creates 32 gangs. Statements 1 and 2 are executed by all gangs. Note that in the original code, these statements are executed only once. However, after the annotation, the compiler will generate code that executes these statements 32 times. This is equivalent to moving a statement into a loop. As long as the statement can be executed extra times without producing incorrect results, this is not a problem.
图 15.11中的第二个编译指示将for循环的工作分配给 32 个组。每个团伙将进一步将自己的工作份额分配给多名工人。当迭代次数和执行单元的数量已知时,每个组中的确切工人数量可能会在运行时决定。
The second pragma in Figure 15.11 assigns the work of the for loop to the 32 gangs. Each gang will further distribute its share of the work to multiple workers. The exact number of workers in each gang will likely be decided at runtime when the number of iterations and the number of execution units are known.
如果语句1、2、5、6、9和10只能执行一次,那么您仍然可以进行如图15.12所示的注释。在这种情况下,只会创建一个拥有 32 名工人的团伙。帮派头目将执行语句 1、2、5、6、9 和 10。它将为两者分配工作 循环到它的 32 个工人。显然,工人数量会比之前的案例少很多,前一个案例雇佣了32个团伙,每个团伙都有多名工人。
If statements 1, 2, 5, 6, 9, and 10 can only be executed once, then you can still make the annotations shown in Figure 15.12. In this case, only one gang with 32 workers will be created. The gang leader will execute statements 1, 2, 5, 6, 9, and 10. It will assign the work for the two for loops to its 32 workers. Obviously, the number of workers will be much lower than the previous case, which employs 32 gangs, each of which has multiple workers.
图 15.12使用 OpenACC 移植更容易(第 2 部分)。
Figure 15.12 Porting is easier with OpenACC (Part 2).
这里重要的一点是,要使用 CUDA 达到相同的效果,两种情况之间需要进行更重大的代码更改。在第一种情况下,需要将语句1和2推入循环中,以便可以使用语句1、2、3和4形成一个内核。同样,需要使用语句5、6、7形成另一个内核, 8、9 和 10。在第二种情况下,语句 1、2、5、6、9 和 10 将保留为主机代码,而语句 1 和 2 将形成内核,语句 7 和 8 将形成第二个内核。核心。我们将这两种情况下内核的详细实现作为练习。
The important point here is that to achieve the same effect with CUDA, more significant code changes are required between the two cases. In the first case, statements 1 and 2 need to be pushed into the loop so that a kernel can be formed with statements 1, 2, 3, and 4. Similarly, another kernel needs to be formed with statements 5, 6, 7, 8, 9, and 10. In the second case, statements 1, 2, 5, 6, 9, and 10 will remain as the host code, whereas statements 1 and 2 will form a kernel and statements 7 and 8 will form a second kernel. We leave the detailed implementation of the kernels in both cases as exercises.
回想一下,OpenACC 旨在支持典型加速器中的多级并行性。循环构造上的向量子句通常用于表示加速器区域中最里面的向量或 SIMD(单指令,多数据)模式循环,如图15.13所示。
Recall that OpenACC was designed to support multiple levels of parallelism found in a typical accelerator. The vector clause on a loop construct is often used to express the innermost vector or SIMD (single instruction, multiple data) mode loop in an accelerator region, as illustrated in Figure 15.13.
图 15.13使用向量子句。
Figure 15.13 Using the vector clause.
在 GPU 上,一种可能的实现是将 gang 映射到 CUDA 块,将工作线程映射到 CUDA warp,将向量元素映射到 warp 内的线程。然而,OpenACC 规范并未强制要求这一点,并且实现(编译器/运行时)可以根据加速器区域内的代码模式选择不同的映射以获得最佳性能。
On a GPU, a possible implementation is to map a gang to a CUDA block, a worker to a CUDA warp, and a vector element to a thread within a warp. However, this is not mandated by the OpenACC specification and an implementation (compiler/runtime) may choose a different mapping based on the code pattern within an accelerator region for best performance.
与并行构造一样,内核构造也允许程序员指定他或她想要在加速器上执行程序的哪一部分。循环构造可以在内核构造内部使用。两者之间的一个主要区别是,一个内核区域可能会被分成一系列内核,每个内核都将在加速器上执行,而整个并行区域将成为一个内核并在加速器上执行。通常,内核构造中的每个循环嵌套都可能成为一个内核,如图15.14所示。
Like the parallel construct, the kernels construct also allows a programmer to specify which part of a program he or she wants to be executed on an accelerator. And a loop construct can be used inside a kernels construct. One major difference between the two is that a kernels region may be broken into a sequence of kernels, each of which will be executed on the accelerator, while the whole parallel region will become a kernel and be executed on the accelerator. Typically, each loop nest in a kernels construct may become a kernel, as illustrated in Figure 15.14.
图 15.14一个内核区域可以被分成一系列内核。
Figure 15.14 A kernels region may be broken into a sequence of kernels.
在图中,内核区域可以分为三个内核,每个循环一个内核,并且它们将按顺序在加速器上执行。某些实现也可能决定不为k循环生成内核,因此该内核区域将包含两个内核 - 一个用于i循环,一个用于j循环,并且k循环在主机上执行。
In the figure, the kernels region may be broken into three kernels, one for each loop, and they will be executed on the accelerator in order. It is also possible that some implementations may decide not to generate a kernel for the k loop and therefore this kernels region will contain two kernels—one for the i loop and the j loop each and the k loop is executed on the host.
内核区域可以包含多个内核,并且每个内核可以使用不同数量的gang、不同数量的worker和不同的向量长度。因此,内核构造上没有num_gangs、num_workers或vector_length子句。如果需要,您可以在封闭循环结构上指定它们,如图15.14所示。
A kernels region may contain multiple kernels and each may use a different number of gangs, a different number of workers, and different vector lengths. Therefore, there is no num_gangs, num_workers, or vector_length clause on the kernels construct. You can specify them on the enclosed loop construct if you want to, as illustrated in Figure 15.14.
现在让我们看一下并行构造和内核构造之间的另一个主要区别。在图15.14中, k循环会执行多少次?之前我们已经了解到,并行构造中的非工作共享代码将由帮派领导冗余执行(见图15.5和15.6)。如果图 15.14中的k循环位于并行结构内,则语句d[k] = c[k]的执行次数是组数的 2,048 倍。这在内核构造中有所不同,在这种情况下,它仅为 2,048 次。
Now let’s take a look at another major difference between the parallel construct and the kernels construct. In Figure 15.14, how many times will the k loop be executed? Previously we’ve learned that the non-work-sharing code inside a parallel construct will be executed redundantly by the gang leads (see Figures 15.5 and 15.6). If the k loop in Figure 15.14 were inside a parallel construct, then the statement d[k] = c[k] is executed 2,048 times the number of gangs. This is different in a kernels construct, in which case it is just 2,048 times.
并行结构和内核结构是从两个不同的角度设计的。内核构造更具描述性。它描述了程序员的意图。编译器负责将程序映射和分区到底层硬件。请注意,当我们解释图 15.14中的内核代码时,我们使用了“may”这个词。符合 OpenACC 的编译器可能决定根本不为图 15.14中的内核区域生成任何内核。用于i和j循环的循环结构告诉编译器生成这样的代码,即仅当编译器决定为这些循环生成内核时,循环迭代才会在帮派领导者之间共享。编译器决定不为循环构造生成内核有两个常见原因。原因之一是安全。编译器检查是否并行化循环将给出与顺序版本相同的执行结果。将对循环和程序的其余部分执行一系列分析。如果编译器发现并行化循环不安全或者由于缺乏信息而无法确定它是否安全,则编译器将不会并行化循环,因此不会为循环构造生成内核。另一个原因是性能。使用 OpenACC 指令的最终目标是获得加速。如果编译器发现这样做只会减慢程序速度,则可能会决定不在加速器上并行化并执行循环。由于编译器将主要处理并行化问题,因此描述性方法使得将程序移植到 OpenACC 相对容易。缺点是生成的加速代码的质量很大程度上取决于所使用的编译器的功能。高质量的编译器应该向程序员提供关于它如何编译内核构造以及为什么它不并行化某些循环的反馈。有了这些信息,程序员就可以确定他或她的意图是否实现,并可以向编译器提供更多提示以实现他或她的目标。在下一节中,我们将展示一些帮助 OpenACC 编译器的方法。
The parallel construct and the kernels construct were designed from two different perspectives. The kernels construct is more descriptive. It describes the intention of the programmer. The compiler is responsible for mapping and partitioning the program to the underlying hardware. Notice we use the word “may” when we explain the kernels code in Figure 15.14. It is possible that an OpenACC-compliant compiler decides not to generate any kernels at all for the kernels region in Figure 15.14. The loop constructs used for the i and j loops tells the compiler to generate such code that the loop iterations will be shared among the gang leaders only if the compiler decides to generate kernels for these loops. There are two common reasons why a compiler decides not to generate a kernel for a loop construct. One reason is safety. The compiler checks whether parallelizing the loop will give the same execution result as the sequential version does. A series of analyses will be performed on the loop and the rest of the program. If the compiler finds it is not safe to parallelize the loop or cannot decide whether it is safe due to lack of information, then the compiler will not parallelize the loop and hence will not generate a kernel for the loop construct. The other reason is performance. The ultimate goal for using OpenACC directive is to get speedup. The compiler may decide not to parallelize and execute a loop on the accelerator if it finds doing so will only slow down the program. Since the compiler will mostly take care of the parallelization issues, the descriptive approach makes porting programs to OpenACC relatively easier. The downside is the quality of the generated accelerated code depends significantly on the capability of the compiler used. A high-quality compiler is expected to give feedbacks to the programmer on how it compiles kernels constructs and why it does not parallelize certain loops. With this information, the programmer can be sure whether his or her intension is achieved and may provide more hints to the compiler to achieve his or her goal. In the next section, we will show a few ways to help an OpenACC compiler.
并行结构更具规范性。编译器执行程序员指示它执行的操作。程序员最终可以更好地控制在何处生成内核以及如何并行化和调度循环。不同的 OpenACC 编译器应该对并行构造执行类似的转换。缺点是没有安全网。如果循环在不同迭代之间具有数据依赖性并且并行化是不安全的,那么程序员不应该将这样的串行循环放入循环构造中。这与 OpenMP 所采用的理念相同,OpenMP 是另一种成功的基于指令的并行编程方法。熟悉 OpenMP 的程序员应该能够轻松使用并行结构。
The parallel construct is more prescriptive. The compiler does what the programmer instructs it to do. The programmer ultimately has more control of where to generate kernels and how to parallelize and schedule loops. Different OpenACC compilers should perform the similar transformations to a parallel construct. The downside is that there is no safety net. If a loop has data dependence between different iterations and is unsafe to be parallelized, then a programmer should not put such a serial loop inside a loop construct. This is the same philosophy taken by OpenMP, which is another successful directive-based approach for parallel programming. Programmers who are familiar with OpenMP should feel comfortable using parallel constructs.
为了并行化内核区域内的循环,OpenACC 编译器通常需要证明循环中不存在交叉迭代数据依赖性。图 15.15中的i循环中不存在数据依赖性。所有迭代都可以并行执行,并给出与顺序执行迭代时相同的结果。 OpenACC 编译器应该可以毫无困难地确定i循环是可并行的。在j循环中,每次迭代都使用前一次迭代中定义的数组元素a[]的值,因此,如果并行执行循环,结果将会不同。OpenACC 编译器应该可以毫不费力地确定“ j ”循环不可并行化。
To parallelize a loop inside a kernels region, an OpenACC compiler generally needs to prove there is no cross-iteration data dependence in the loop. There is no data dependence in the i loop in Figure 15.15. All iterations can be executed in parallel and give the same result as when the iterations are executed sequentially. An OpenACC compiler should have no trouble deciding the i loop is parallelizable. In the j loop, each iteration uses the value of array element a[] defined in the previous iteration, therefore, the result will be different if the loops are executed in parallel. An OpenACC compiler should have no trouble deciding the ‘j’ loop is not parallelizable.
图 15.15数据依赖性。
Figure 15.15 Data dependence.
对于k循环,如果x[]和y[]没有别名,则不存在数据依赖性。然而,这不能通过单独检查函数foo()来决定。如果 OpenACC 编译器不执行过程间分析或者函数foo()的调用点不可用,则编译器必须保守地假设存在数据依赖性。如果x[]和y[]确实没有别名,我们可以将 C限制限定符添加到指针参数x和y的声明中,如图15.16所示。然后,OpenACC 编译器应该能够使用此信息来确定k循环是否可并行。
For the k loop, there is no data dependence if x[] and y[] are not aliased. However, this cannot be decided by examining the function foo() alone. If an OpenACC compiler does not perform interprocedural analysis or the call site of function foo() is not available, then the compiler has to conservatively assume there is data dependence. If x[] and y[] are indeed never aliased, we can add the C restrict qualifier to the declaration of pointer argument x and y as illustrated in Figure 15.16. An OpenACC compiler should then be able to use this information to decide if the k loop is parallelizable.
图15.16使用restrict限定符指定无别名。
Figure 15.16 Use restrict qualifier to specify no alias.
现在如何处理1循环?如果 n 的值不小于 m,则该循环是可并行的。让我们假设在这个程序中这始终是正确的。然而,没有 C 语言结构来表达此类信息。在这种情况下,我们可以在循环结构中添加一个独立子句,如图15.17所示。独立子句只是告诉编译器关联的循环是可并行的,不需要分析。您还可以将独立子句添加到“ i ”。您还可以将该子句添加到“ j ”循环构造中,但这将是不正确的。
Now what to do with the 1 loop? This loop is parallelizable if the value of n is no less than m. Let’s assume this is always true in this program. However, there is no C language construct to express such information. In this case, we can add an independent clause to the loop construct, as illustrated in Figure 15.17. The independent clause simply tells the compiler that the associated loop is parallelizable and no analysis is required. You can also add the independent clause to the ‘i’. You could also add the clause to the ‘j’ loop construct, but that would not be correct.
图15.17使用“ independent ”子句声明循环是可并行的。
Figure 15.17 Use the ‘independent’ clause to declare a loop is parallelizable.
到目前为止,您已经看到了在并行和内核构造上使用的copy、copyin和copyout子句。这些称为数据条款。数据子句具有由逗号分隔的参数列表。每个参数可以是变量名或子数组规范。 OpenACC 编译器和运行时将在设备内存中创建变量或子数组的副本。对并行或内核构造中的变量或子数组的引用将针对设备副本进行。
So far you have seen copy, copyin, and copyout clauses used on parallel and kernels constructs. These are called data clauses. A data clause has a list of arguments separated by a comma. Each argument can be a variable name or a subarray specification. The OpenACC compiler and runtime will create a copy of the variable or subarray in the device memory. Reference to the variable or subarray within the parallel or kernels constructs will be made to the device copy.
图 15.18中的代码片段来自图 15.4中的矩阵乘法示例。这里,设备上分配了三块内存。数组M和N是输入数据,因此它们被声明为copyin。从主机内存到设备内存的复制发生在并行区域开始执行之前。数组P是输出数据,因此将其声明为copyout。从设备内存复制到主机内存在并行区域结束后立即发生。 copy子句可用于声明需要复制入和复制出的数据。
The code snippet in Figure 15.18 is from the matrix multiplication example in Figure 15.4. Here, three pieces of memory are allocated on the device. Arrays M and N are the input data, so they are declared as copyin. The copyin from the host memory to the device memory happens right before the parallel region starts execution. Array P is the output data, so it is declared as copyout. The copyout from the device memory to the host memory happens right after the parallel region ends. The copy clause can be used to declare data that needs to be both copied in and copied out.
请注意,此处的子数组规范用于M、N和P。这是因为M、N和P实际上是指针,我们需要指定需要复制的内存范围。:之前和之后的值分别指定起始数组元素和数组元素的数量。所以M[0:Mh*Mw]意味着M[0]、M[1]、M[2]、…和M[Mh*Mw-1]。程序员常见的错误是将第二个值误认为是最后一个数组元素。
Notice that the subarray specification is used for M, N, and P here. That’s because M, N, and P are actually pointers and we need to specify the range of memory that needs to be copied. The value before and after the : specifies the starting array element and the number of array elements, respectively. So M[0:Mh∗Mw] means M[0], M[1], M[2], …, and M[Mh∗Mw-1]. A common programmer error is mistaking the second value as the last array element.
有些变量不需要复制入或复制出——它们的值是在内核中生成和使用的。在这种情况下,可以使用create子句。
Some variables do not need to be copied in or copyied out—their values are generated and consumed within a kernel. In such cases, the create clause can be used.
另一个常用的数据子句是deviceptr子句。该子句将指针列表作为其参数,并声明这些实际上是设备指针,以便不需要在主机和设备之间为这些指针所指向的内存分配或移动数据。当程序同时使用 OpenACC 和 CUDA 内核(或 CUDA 库,例如 cuFFT、cuBLAS 等)时,deviceptr子句会变得很方便。图 15.19显示了执行两次矩阵乘法的示例,首先使用 CUDA 内核,然后使用 OpenACC 并行区域 - 两者都在cudaMalloc()分配的同一设备内存上工作。
Another commonly used data clause is the deviceptr clause. This clause takes a list of pointers as its argument and declares that these are actually device pointers so that the data does not need to be allocated or moved between the host and the device for memory pointed by these pointers. When a program uses both OpenACC and CUDA kernels (or CUDA libraries, such as cuFFT, cuBLAS, etc.), the deviceptr clause becomes handy. Figure 15.19 shows an example of doing the matrix multiplication twice, first using a CUDA kernel and then using an OpenACC parallel region—both work on the same device memory allocated by cudaMalloc().
图 15.19使用deviceptr将cudaMalloc()数据传递到 OpenACC 并行或内核区域。
Figure 15.19 Use deviceptr to pass cudaMalloc() data to OpenACC parallel or kernels region.
在 OpenACC 中,主机内存和设备内存是分开的。主机和加速器之间的数据传输对 OpenACC 应用程序的整体性能起着重要作用。例如,当使用并行循环实现的迭代求解器的计算密集型循环嵌套在每次迭代时在主机和加速器之间来回传输数据时,可能会出现性能损失。 OpenACC数据构造允许人们通过避免并行或内核区域的多次执行期间的数据传输来利用重用。
In OpenACC, host memory and device memory are separated. Data transfer between the host and the accelerator can play a significant role in the overall performance of an OpenACC application. For example, when a computationally intense loop nest of an iterative solver, implemented using a parallel loop, transfers data back and forth between host and the accelerator at every iteration, then there may be a loss of performance. The OpenACC data construct allows one to exploit reuse by avoiding data transfers during multiple executions of parallel or kernels regions.
图 15.20显示了 2D 雅可比松弛的简化实现。数组字段中的每个元素通过取平均值来更新每个元素及其八个邻居。这重复了 256 次。另一个数组tmpfield用于使松弛并行。在每一遍中,从一个数组读取值并计算平均值,然后将其写入第二个数组中的相同位置。由于两个数组不重叠,因此更新完全是数据并行的。第 6-24 行实现了一次放宽。每个通道都在 OpenACC 中执行平行区域。我们将 256 次传球分为 128 对。每对包含两个并行区域 - 一个使用field更新tmpfield,另一个使用tmpfield更新字段。回想一下,帮派之间没有同步。因此,我们需要两个并行结构来确保对一个数组的写入完成后,该数组才能用作下一轮中的更新源。
Figure 15.20 shows a simplified implementation of a 2D Jacobi relaxation. Each element in the array field is updated by taking the average of each element with its eight neighbors. This is repeated 256 times. Another array tmpfield is used to make the relaxation parallel. In each pass, the values are read from one array and the average is computed and then it is written into the same position in the second array. Since the two arrays do not overlap, the updates are completely data parallel. Lines 6-24 implement one pass of the relaxation. Each pass is executed in an OpenACC parallel region. We group the 256 passes into 128 pairs. Each pair contains two parallel regions—one updates tmpfield with field, and the other updates field with tmpfield. Recall that there is no synchronization between gangs. Therefore, we need two parallel constructs to make sure the writes to one array are completed before the array can be used as the source of updates in the next pass.
图 15.20数据和更新结构的使用。
Figure 15.20 Use of data and update constructs.
我们希望数据在所有 256 次传递期间都保留在设备上。这是通过使用第 28 行中的数据构造来实现的。数据构造指定的数据区域来自第 30-38 行,包括所有被调用的函数。 copy (field)子句表示我们需要创建数组field的设备副本,在数据区域开始时将其数据从主机复制到设备,并在数据区域结束时将其数据复制回主机。对于第 31 行和第 33 行所包含的并行结构,只需使用field的此副本。create(tmpfield)子句表示我们需要为此数据区域创建数组tmpfield的设备副本,并将此副本用于第 31 行和第 33 行的封闭并行结构。
We want the data to stay on the device during all the 256 passes. This is achieved by using the data construct in line 28. The data region specified by the data construct is from line 30-38, including all called functions. The copy(field) clause says we need to create a device copy of array field, copy its data from the host to the device when the data region starts, and copy its data back to the host when the data region ends. And for the enclosed parallel constructs at lines 31 and 33, just use this copy of field. The create(tmpfield) clause says we need to create a device copy of array tmpfield for this data region, and use this copy for the enclosed parallel constructs at lines 31 and 33.
现在,数据在传输过程中始终位于设备上。如果我们想偶尔检查一下主机上的中间结果怎么办?我们可以通过使用update指令来完成此操作,如第 36 行所示。这表示主机数组“ fields ”的值此时应使用设备副本的值进行更新。由于更新是在代码中有条件地执行的,因此如果不需要,数据传输将不会发生。更新指令还可用于将设备副本的值更新为主机副本的值。
Now the data is on the device all the time during the passes. What if we want to check the intermediate result on the host occasionally? We can do it by using the update directive, as illustrated at line 36. This says the value of the host array ‘fields’ should be updated with that of the device copy at this point. Since the update is performed conditionally in the code, the data transfer won’t happen if not required. The update directive can also be used to update the value of the device copy with that of the host copy.
OpenACC 提供对异步计算和数据传输的支持。可以将async子句添加到parallel、kernels或update指令中以启用异步执行。如果没有async子句,主机进程将等到并行区域、内核区域或更新完成后再继续。如果存在async子句,则当异步处理并行区域、内核区域或更新时,主机进程将继续执行指令后面的代码。可以使用wait指令或 OpenACC 运行时库例程来等待异步事件。
OpenACC provides support for asynchronous computation and data transfer. An async clause can be added to parallel, kernels, or update directive to enable asynchronous execution. If there is no async clause, the host process will wait until the parallel region, kernels region, or updates are complete before continuing. If there is an async clause, the host process will continue with the code following the directive when the parallel region, kernels region, or updates are processed asynchronously. An asynchronous event can be waited by using the wait directive or OpenACC runtime library routines.
在图 15.20的雅可比松弛示例中,字段的主机副本(第 37 行)的更新及其在主机上的显示可能与设备上tmpfield (第 31 和 32 行)的计算并行发生。
In the Jacobi relaxation example in Figure 15.20, the update of the host copy of field (line 37) and the display of it on the host could happen in parallel with the compute of tmpfield (lines 31 and 32) on the device.
在图 15.21中,为了启用异步执行,我们将更新和显示移动到两个并行区域之间,在第 31 行的并行指令中添加一个async子句,并在第 33 行的第二个并行指令之前添加一个wait指令。
In Figure 15.21, to enable the asynchronous execution, we move the update and display in between the two parallel regions, add an async clause to the parallel directive at line 31, and add a wait directive before the second parallel directive at line 33.
图 15.21 异步和等待。
Figure 15.21 async and wait.
我们可以通过调用 OpenACC acc_async_wait_all()例程来替换wait指令,并达到相同的效果。 OpenACC 提供了一组更丰富的例程来支持异步等待功能,包括测试异步活动是否已完成而不仅仅是等待其完成的功能。
We can replace the wait directive with a call to the OpenACC acc_async_wait_all() routine and achieve the same effect. OpenACC provides a richer set of routines to support the asynchronous wait functionality, including the capability to test whether an asynchronous activity has completed rather than just waiting for its completion.
我们相信,使用 OpenACC 将成为将现有应用程序移植到加速器甚至从头开始编写加速应用程序的一种有前途且有效的方法。以下是我们看到的 OpenACC 编程模型的几个方向。
We believe using OpenACC will become a promising and effective approach to port existing applications to accelerators and even write accelerated applications from scratch. The following are a few directions we see the OpenACC programming model going in.
• 更笼统一些。当前的 OpenACC 模型和实现有很多限制,例如函数调用必须能够内联、不支持设备上的动态内存分配等。这是因为大多数 OpenACC 功能最初都是设计的在 CUDA 3.0 时间范围内。此后,更多的软件和硬件功能在CUDA平台上被开发出来。例如,在CUDA 4.0中,GPU可以跨多个线程共享,并且添加了C++ new/delete和虚拟函数支持。在CUDA 5.0,可以单独编译和设备代码链接。 OpenACC一定会利用这些新技术,让程序模型更加通用。
• Be more general. The current OpenACC model and implementations have quite a few limitations, such as function calls must be able to be in-lined, no support for dynamic memory allocation on the device, etc. It is due to the fact that most OpenACC features were originally designed at the CUDA 3.0 timeframe. Since then, more software and hardware features have been developed on the CUDA platform. For example, in CUDA 4.0, GPUs can be shared across multiple threads, and C++ new/delete and virtual functions support are added. In CUDA 5.0, separate compilation and device code linking is available. OpenACC will certainly take advantage of these new technologies to make the program model more general.
• 与 OpenMP 集成。 OpenMP 和 OpenACC 都使用指令方法进行并行编程。 OpenMP 传统上一直专注于共享内存系统。 OpenMP ARB 成立了一个加速器工作组来扩展加速器上的 OpenMP 支持。所有 OpenACC 创始成员都是该工作组的成员。这些成员打算合并这两个规范以创建一个通用规范。
• Integrated with OpenMP. OpenMP and OpenACC both use the directive approach to parallel programming. OpenMP has traditionally been focusing on shared memory systems. The OpenMP ARB has formed an accelerator working group to extend OpenMP support on accelerators. All OpenACC founding members are members of this working group. These members intend to merge the two specifications to create a common one.
最后但并非最不重要的一点是,我们鼓励您访问 OpenACC 官方网站openacc.org来关注 OpenACC 的最新发展。除了规范本身的最新更新之外,该网站还提供丰富的文档、常见问题解答、教程、代码示例、供应商新闻和讨论论坛资源。
Last but not least, we encourage you to follow the latest development of OpenACC by visiting the official OpenACC web site at openacc.org. Besides the latest update to the specification itself, the web site provides a rich resource for documents, FAQs, tutorials, code samples, vendor news, and discussion forums.
15.1. 在下面的并行区域中,语句 1 总共会执行多少个实例?
#pragma acc 并行帮派(1024) 工人(32)
{
#pragma acc 循环工作者
for (int i=0; i<2048; i++) {
声明1;
}
}
15.1. In the following parallel region, how many instances of statement 1 will be executed in total?
#pragma acc parallel gang(1024) worker(32)
{
#pragma acc loop worker
for (int i=0; i<2048; i++) {
statement 1;
}
}
15.2. What are the two major differences between the parallel construct and the kernels construct?
15.3. Implement the matrix multiplication using the kernels construct.
15.4。 使用内核构造重新实现雅可比松弛。使用不同数量的组、作品和向量长度来查看它们如何影响性能。
15.4. Reimplement Jacobi relaxation using the kernels constructs. Use a different number of gangs, works, and vector lengths to see how they affect performance.
16.1 背景
16.1 Background
16.2 动机
16.2 Motivation
16.3 基本推力特性
16.3 Basic Thrust Features
16.4 通用编程
16.4 Generic Programming
16.5 抽象的好处
16.5 Benefits of Abstraction
16.6 程序员生产力
16.6 Programmer Productivity
16.7 最佳实践
16.7 Best Practices
16.8 练习
16.8 Exercises
本章演示如何利用 Thrust 并行模板库以最少的编程工作实现高性能应用程序。 Thrust 基于 C++ 标准模板库 (STL),为 GPU 计算领域带来了熟悉的高级接口,同时保持与 CUDA 软件生态系统其他部分的完全互操作性。 Thrust 提供了一组类型通用的并行算法,可以与用户定义的数据类型一起使用。这些并行算法可以显着减少开发并行应用程序的工作量。使用 Thrust 编写的应用程序简洁、可读且高效。
This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Thrust provides a set of type-generic parallel algorithms that can be used with user-defined data types. These parallel algorithms can significantly reduce the effort of developing parallel applications. Applications written with Thrust are concise, readable, and efficient.
C++ 为程序员提供了一种定义泛型的方法。当一个编程问题对于许多不同的情况有相同的解决方案时数据类型,使用泛型可以一劳永逸地编写解决方案。例如,以下代码中显示的两个 C++ 函数对float数组和int数组求和。它们是在不使用类型泛型的情况下定义的。第一个和第二个函数之间的唯一区别是float更改为int。
C++ provides a way for programmers to define generics. In situations when a programming problem has the same solution for many different data types, the solution can be written once and for all using generics. For example, the two C++ functions shown in the following code sum a float array and an int array. They are defined without using type generics. The only difference between the first and second function is that float is changed to int.
| 浮点总和(int n,浮点* p){ | int sum(int n, int *p) { |
| 浮动a = 0; | 整数a = 0; |
| for (int i = 0; i < n; i++) a += p[i]; | for (int i = 0; i < n; i++) a += p[i]; |
| 返回一个; | 返回一个; |
| } | } |
以下通用sum函数可用于任何数据类型,而不是为每种数据类型编写不同版本的sum 。这个想法是程序员准备一个可以在不同类型的数组上实例化的sum函数的模板。 template关键字指示类型通用定义的开始。从现在开始,我们将互换使用类型泛型和泛型。
Instead of writing a different version of sum for each data type, the following generic sum function can be used with any data type. The idea is that the programmer prepares a template of the sum function that can be instantiated on different types of array. The template keyword indicates the beginning of a type-generic definition. From this point on, we will use type-generic and generic interchangeably.
模板<类型名称 T>
template<typename T>
T sum(int n, T *p) {
T sum(int n, T ∗p) {
塔 = 0;
T a = 0;
for (int i = 0; i < n; i++) a += p[i];
for (int i = 0; i < n; i++) a += p[i];
返回一个;
return a;
}
}
该代码使用T作为实际类型所需的占位符。在通用代码中用float替换T会产生sum的两个定义之一,而用int替换T会产生另一个。T也可以被其他类型替换,包括用户定义的类型。每次使用sum函数时,C++ 编译器都会进行适当的替换。因此,sum 的行为与前面的重载 C++ 函数非常相似,并且可以像重载函数一样使用它。泛型编程的核心概念是使用类型参数,就像本例中的T一样,它可以被任意类型替换。
The code uses T as a placeholder where the actual type needs to be. Replacing T by float in the generic code yields one of the two definitions of sum, while replacing T by int yields the other. T could also be replaced by other types, including user-defined types. A C++ compiler will make the appropriate replacement each time the sum function is used. Consequently, sum behaves much like the preceding overloaded C++ function, and it can be used as if it were an overloaded function. The central concept of generic programming is the use of type parameters, like T in this example, that can be replaced by arbitrary types.
Thrust 是一个通用函数库。通过为要支持的每种计算类型提供通用函数,Thrust 不需要为每种符合条件的数据类型复制每个函数的多个版本。
Thrust is a library of generic functions. By providing generic functions for each type of computation to be supported, Thrust does not need to have multiple versions of each function replicated for each eligible data type.
事实上,并非所有数据类型都可以与泛型函数一起使用。因为sum使用加法并将a初始化为 0,所以它要求类型T的行为(广义上)类似于数字。将T替换为数字类型int或float会生成有效的函数定义,但将T替换为void或FILE*则不会。这些要求称为概念,当类型满足要求时,就可以对概念进行建模。总之,无论什么取代T都必须模拟“数字”概念。也就是说,sum将计算一个总和,前提是它给出了一个指向某种类型T的指针,该类型的行为类似于数字类型。否则,它可能会产生错误或返回无意义的结果。像 Thrust 这样的通用库依赖于概念作为其界面的一部分。
In fact, not all data types can be used with a generic function. Because sum uses addition and initializes a to 0, it requires the type T to behave (broadly speaking) like a number. Replacing T by the numeric types int or float produces a valid function definition, but replacing T by void or FILE∗ does not. Such requirements are called concepts, and when a type satisfies a requirement it is said to model a concept. In sum, whatever replaces T must model the “number” concept. That is, sum will compute a sum provided that it’s given a pointer to some type T that acts like a numeric type. Otherwise, it may produce an error or return a meaningless result. Generic libraries like Thrust rely on concepts as part of their interface.
C++ 类也可以是泛型的。这个想法类似于泛型函数,具有类的字段可以依赖于类型参数的额外功能。泛型通常用于定义可重用的容器类,例如 STL [HB2011]中的容器类。容器类是数据结构的实现,例如队列、链表和哈希表,可用于保存任意数据类型。例如,一个非常简单的通用数组容器类可以定义如下:
C++ classes can be generic as well. The idea is similar to generic functions, with the extra feature that a class’s fields can depend on type parameters. Generics are commonly used to define reusable container classes, such as those in the STL [HB2011]. Container classes are implementations of data structures, such as queues, linked lists, and hash tables, that can be used to hold arbitrary data types. For instance, a very simple generic array container class could be defined as follows:
模板<类型名称 T>
template<typename T>
类数组{
class Array {
T内容[10];
T contents[10];
民众:
public:
T read(int i) {返回内容[i];}
T read(int i) {return contents[i];}
void write(int i, T x} {内容[i] = x;}
void write(int i, T x} {contents[i] = x;}
};
};
可以使用此泛型类创建不同数据类型的容器。它们的类型被写为通用类名,后跟尖括号中的类型:Array<int>表示int数组,Array<float *>表示float*数组,依此类推。尖括号中给出的类型替换类定义中的类型参数。
Containers for different data types can be created using this generic class. Their types are written as the generic class name followed by a type in angle brackets: Array<int> for an array of int, Array<float ∗> for an array of float∗, and so forth. The type given in angle brackets replaces the type parameter in the class definition.
虽然这不是对泛型如何工作的完整描述,但它传达了理解本章中泛型使用的基本思想。
While this is not a complete description of how generics work, it conveys the essential ideas for understanding the use of generics in this chapter.
我们再介绍一个背景概念:迭代器。就像使用指针访问数组一样,迭代器也用于访问容器类。术语“迭代器”既指 C++ 概念,也指其类型是该概念模型的值。迭代器表示容器内的位置:它可用于访问该位置的元素、用于转到相邻位置或与其他位置进行比较。
We will introduce one more background concept: iterators. In the same way that pointers are used to access arrays, iterators are used to access container classes. The term iterator refers to both a C++ concept and a value of which the type is a model of this concept. An iterator represents a position within a container: it can be used to access the element at that position, used to go to a neighboring position, or compared to other positions.
指针是迭代器概念的模型,它们可用于循环数组,如下所示:
Pointers are a model of the iterator concept, and they can be used to loop over an array as shown in the following:
整数a[50];
int a[50];
for (int *i = a; i < a + 50; i++) *i = 1;
for (int ∗i = a; i < a + 50; i++) ∗i = 1;
迭代器可用于以非常相似的方式循环 STL 向量:
Iterators can be used to loop over an STL vector in a very similar way:
向量<int> a(50);
vector<int> a(50);
for (vector<int>:迭代器 i = a.begin(); i < a.end(); i++) *i = 1;
for (vector<int>:iterator i = a.begin(); i < a.end(); i++) ∗i = 1;
成员函数begin()和end()返回引用向量开头和刚刚过去的结尾的迭代器。 ++ 、<和 * 运算符被重载,其行为就像它们的指针对应项一样。由于许多容器类提供迭代器接口,因此可以重用使用迭代器的通用 C++ 代码来处理不同类型的容器。
The member functions begin() and end() return iterators referencing the beginning and just past the end of the vector. The ++, <, and ∗ operators are overloaded to act like their pointer counterparts. Because many container classes provide an iterator interface, generic C++ code using iterators can be reused to process different kinds of containers.
CUDA C 允许开发人员就如何将计算分解为并行线程并在设备上执行进行细粒度决策。 CUDA C 提供的控制级别是一个重要功能:它有助于为各种计算要求较高的任务开发高性能算法,这些任务 (1) 值得进行重大优化,(2) 从映射的低级别控制中获益到硬件上。对于此类计算任务,CUDA C 是一个出色的解决方案。
CUDA C allows developers to make fine-grained decisions about how computations are decomposed into parallel threads and executed on the device. The level of control offered by CUDA C is an important feature: it facilitates the development of high-performance algorithms for a variety of computationally demanding tasks that (1) merit significant optimization, and (2) profit from low-level control of the mapping onto hardware. For this class of computational tasks CUDA C is an excellent solution.
Thrust [HB2011]解决了一组互补的问题,即那些 (1) 在没有将工作详细映射到目标架构的情况下有效实现的问题,或者 (2) 不值得或根本不会得到重大优化工作的问题。用户。通过 Thrust,开发人员可以使用一系列高级算法来描述他们的计算,并将如何实现计算的决策完全委托给库。这个抽象接口允许程序员描述要计算的内容,而不会对如何执行计算施加任何额外的限制。通过在高层次上捕获程序员的意图,Thrust 可以自行决定代表程序员做出明智的决策并选择最有效的实现。
Thrust [HB2011] solves a complementary set of problems, namely those that are (1) implemented efficiently without a detailed mapping of work onto the target architecture, or those that (2) do not merit or simply will not receive significant optimization effort by the user. With Thrust, developers describe their computation using a collection of high-level algorithms and completely delegate the decision of how to implement the computation to the library. This abstract interface allows programmers to describe what to compute without placing any additional restrictions on how to carry out the computation. By capturing the programmer’s intent at a high level, Thrust has the discretion to make informed decisions on behalf of the programmer and select the most efficient implementation.
高级库的价值在高性能计算领域得到了广泛认可。例如,广泛使用的 BLAS 标准为常见的线性代数运算提供了抽象接口。第一的BLAS 诞生于三十多年前,在很大程度上至今仍具有重要意义,因为它允许在统一接口后面引入有价值的、特定于平台的优化。
The value of high-level libraries is broadly recognized in high-performance computing. For example, the widely used BLAS standard provides an abstract interface to common linear algebra operations. First conceived more than three decades ago, BLAS remains relevant today in large part because it allows valuable, platform-specific optimizations to be introduced behind a uniform interface.
BLAS 专注于数值线性代数,而 Thrust 为基本并行算法(例如扫描、排序和归约)提供了一个抽象接口。 Thrust 利用 C++ 模板的强大功能使这些算法变得通用,使它们能够与任意用户定义的类型和运算符一起使用。 Thrust 为并行计算建立了一个持久的接口,着眼于通用性、程序员的生产力和实际性能。
Whereas BLAS is focused on numerical linear algebra, Thrust provides an abstract interface to fundamental parallel algorithms such as scan, sort, and reduction. Thrust leverages the power of C++ templates to make these algorithms generic, enabling them to be used with arbitrary user-defined types and operators. Thrust establishes a durable interface for parallel computing with an eye toward generality, programmer productivity, and real-world performance.
在详细讨论之前,让我们考虑一下图 16.1中的程序,它说明了 Thrust 的显着特征。
Before going into greater detail, let us consider the program in Figure 16.1, which illustrates the salient features of Thrust.
图 16.1在 GPU 上对数据进行排序的完整 Thrust 程序。
Figure 16.1 A complete Thrust program that sorts data on a GPU.
Thrust 提供了两个向量容器:host_vector和device_vector。顾名思义,host_vector存储在主机内存中,而device_vector 则存储在 GPU 上的设备内存中。与 C++ STL 中的向量容器一样,host_vector和device_vector是可以动态调整大小的通用容器(即,它们能够存储任何数据类型)。如示例所示,容器自动分配和取消分配内存,并简化主机和设备之间交换数据的过程。
Thrust provides two vector containers: host_vector and device_vector. As the names suggest, host_vector is stored in the host memory while device_vector lives in the device memory on a GPU. Like the vector container in the C++ STL, host_vector and device_vector are generic containers (i.e., they are able to store any data type) that can be resized dynamically. As the example shows, containers automate the allocation and de-allocation of memory and simplify the process of exchanging data between the host and the device.
该程序使用生成、排序和复制算法作用于向量容器。在这里,我们采用 STL 约定,使用迭代器对指定范围。在此示例中,迭代器h_vec.begin()和h_vec.end()分别指向第一个元素和超出数组末尾的元素。该对共同定义了大小为h_vec.end() – h_vec.begin()的整数范围。
The program acts on the vector containers using the generate, sort, and copy algorithms. Here, we adopt the STL convention of specifying ranges using pairs of iterators. In this example, the iterators h_vec.begin() and h_vec.end() point to the first element and the element one past the end of the array, respectively. Together the pair defines a range of integers of size h_vec.end() – h_vec.begin().
请注意,即使调用排序算法隐含的计算建议启动一个或多个 CUDA 内核,程序员也没有指定启动配置。 Thrust 的接口抽象了这些细节。对性能敏感的变量(例如库的网格和块大小)的选择、内存管理的细节,甚至排序算法的选择都由实现者自行决定。
Note that even though the computation implied by the call to the sort algorithm suggests one or more CUDA kernel launches, the programmer has not specified a launch configuration. Thrust’s interface abstracts these details. The choice of performance-sensitive variables such as grid and block size of the library, the details of memory management, and even the choice of sorting algorithm are left to the discretion of the implementer.
尽管向量迭代器与指针类似,但它们携带附加信息。请注意,我们不必指示排序算法正在对device_vector的元素进行操作,也不必提示该副本是从设备内存到主机内存。在 Thrust 中,每个范围的内存空间都是从迭代器参数自动推断出来的,并用于调度适当的实现。
Although vector iterators are similar to pointers, they carry additional information. Notice that we did not have to instruct the sort algorithm that it was operating on the elements of a device_vector or hint that the copy was from the device memory to the host memory. In Thrust the memory spaces of each range are automatically inferred from the iterator arguments and used to dispatch the appropriate implementation.
除了内存空间之外,Thrust 的迭代器还隐式编码了大量可以指导调度过程的信息。例如,图 16.1中的排序示例对int进行操作,这是一种具有基本比较操作的原始数据类型。在这种情况下,Thrust 调度高度调整的基数排序算法[MG2010] ,该算法比其他基于比较的排序算法(例如合并排序[SHG2009])要快得多。重要的是要认识到这个调度过程不会产生性能或存储开销:由迭代器编码的元数据仅存在于编译时,并且基于它来选择调度策略静态地。一般来说,Thrust 的静态调度策略可以利用可从迭代器类型导出的任何信息。
In addition to memory space, Thrust’s iterators implicitly encode a wealth of information that can guide the dispatch process. For instance, our sort example in Figure 16.1 operates on int, a primitive data type with a fundamental comparison operation. In this case, Thrust dispatches a highly tuned radix sort algorithm [MG2010] that is considerably faster than alternative comparison-based sorting algorithms such as merge sort [SHG2009]. It is important to realize that this dispatch process incurs no performance or storage overhead: metadata encoded by iterators exists only at compile time, and dispatch strategies based on it are selected statically. In general, Thrust’s static dispatch strategies may capitalize on any information that is derivable from the type of an iterator.
Thrust 完全在 CUDA C/C++ 中实现,并保持与 CUDA 生态系统其他部分的互操作性。互操作性是一个重要的特性,因为没有一种语言或库是解决所有问题的最佳工具。例如,虽然 Thrust 算法在内部使用了共享内存等 CUDA 功能,但用户并没有通过 Thrust 直接利用共享内存的机制。因此,应用程序有时需要直接访问 CUDA C 来实现某一类专用算法,如图16.2的软件堆栈所示。 Thrust 和 CUDA C 之间的互操作性允许程序员通过对周围代码进行少量更改来用 CUDA 内核替换 Thrust 内核,反之亦然。
Thrust is implemented entirely within CUDA C/C++ and maintains interoperability with the rest of the CUDA ecosystem. Interoperability is an important feature because no single language or library is the best tool for every problem. For example, although Thrust algorithms use CUDA features like shared memory internally, there is no mechanism for users to exploit shared memory directly through Thrust. Therefore, it is sometimes necessary for applications to access CUDA C directly to implement a certain class of specialized algorithms, as illustrated in the software stack of Figure 16.2. Interoperability between Thrust and CUDA C allows the programmer to replace a Thrust kernel with a CUDA kernel and vice versa by making a small number of changes to the surrounding code.
图 16.2 Thrust 是 CUDA C/C++ 之上的抽象层。
Figure 16.2 Thrust is an abstraction layer on top of CUDA C/C++.
Thrust 与 CUDA C 的接口非常简单,类似于使用 C++ STL 和标准 C 代码。外部库可以通过从向量中提取“原始”指针来访问驻留在 Thrust 容器中的数据。图 16.3中的代码示例说明了如何使用原始指针转换来获取指向设备向量内容的int点。
Interfacing Thrust to CUDA C is straightforward and analogous to the use of the C++ STL with standard C code. Data that resides in a Thrust container can be accessed by external libraries by extracting a “raw” pointer from the vector. The code sample in Figure 16.3 illustrates the use of a raw pointer cast to obtain an int point to the contents of a device vector.
图 16.3 Thrust 与 CUDA C/C++ 顺利互操作:(a) Thrust 与 CUDA 接口,(b) CUDA 与 Thrust 接口。
Figure 16.3 Thrust interoperates smoothly with CUDA C/C++: (a) interfacing Thrust to CUDA, and (b) interfacing CUDA to Thrust.
在图 16.3(a)中,函数raw_pointer_cast()获取设备向量d_vec的元素 0 的地址并返回原始 C 指针raw_ptr。然后可以使用该指针来调用 CUDA C API 函数(本例中为cudaMemset())或作为参数传递给 CUDA C 内核(本例中为my_kernel)。
In Figure 16.3(a), the function raw_pointer_cast() takes the address of element 0 of a device vector d_vec and returns a raw C pointer raw_ptr. This pointer can then be used to call CUDA C API functions (cudaMemset() in this example) or passed as a parameter to a CUDA C kernel (my_kernel in this example).
将 Thrust 算法应用于原始 C 指针也很简单。一旦原始指针被device_ptr包装,它就可以像普通的 Thrust 迭代器一样使用。在图16.3(b)中,C指针raw_ptr指向由cudaMalloc()分配的一块设备内存。可以通过device_pointer_cast()函数将其转换或包装为指向设备向量的设备指针。包装指针提供了 Thrust 调用适当算法实现所需的内存空间信息,并且还提供了一种从主机访问设备内存的便捷机制。在这种情况下,信息表明dev_ptr指向设备内存中的一个向量,并且元素的类型为int。
Applying Thrust algorithms to raw C pointers is also straightforward. Once the raw pointer has been wrapped by a device_ptr it can be used like an ordinary Thrust iterator. In Figure 16.3(b), the C pointer raw_ptr points to a piece of device memory allocated by cudaMalloc(). It can be converted or wrapped into a device pointer to a device vector by the device_pointer_cast() function. The wrapped pointer provides the memory space information Thrust needs to invoke the appropriate algorithm implementation and also allows a convenient mechanism for accessing device memory from the host. In this case, the information indicates that dev_ptr points to a vector in the device memory and the elements are of type int.
Thrust 的原生 CUDA C 互操作性确保 Thrust 始终补充CUDA C,并且 Thrust 加 CUDA C 组合绝不会比单独使用 Thrust 或 CUDA C 差。事实上,虽然可以完全使用 Thrust 函数编写整个并行应用程序,但直接在 CUDA C 中实现特定于领域的功能通常很有价值。本机 CUDA C 所针对的抽象级别提供了程序员对计算资源到特定问题的精确映射进行细粒度控制。此级别的编程为开发人员提供了实现奇异或其他专用算法的灵活性。互操作性还有助于迭代开发策略:(1) 完全在 Thrust 中快速构建并行应用程序原型,(2) 识别应用程序的热点,以及 (3) 在 CUDA C 中编写更专业的算法并根据需要进行优化。
Thrust’s native CUDA C interoperability ensures that Thrust always complements CUDA C and that a Thrust plus CUDA C combination is never worse than either Thrust or CUDA C alone. Indeed, while it may be possible to write whole parallel applications entirely with Thrust functions, it is often valuable to implement domain-specific functionality directly in CUDA C. The level of abstraction targeted by native CUDA C affords programmers fine-grained control over the precise mapping of computational resources to a particular problem. Programming at this level provides developers the flexibility to implement exotic or otherwise specialized algorithms. Interoperability also facilitates an iterative development strategy: (1) quickly prototype a parallel application entirely in Thrust, (2) identify the application’s hot spots, and (3) write more specialized algorithms in CUDA C and optimize as necessary.
Thrust 提出了一种强调代码可重用性和可组合性的编程风格。事实上,Thrust 的绝大多数功能都源自四种基本并行算法:for_each、reduce、scan和sort。例如,变换算法是for_each的导数,而内积是用reduce实现的。
Thrust presents a style of programming emphasizing code reusability and composability. Indeed, the vast majority of Thrust’s functionality is derived from four fundamental parallel algorithms: for_each, reduce, scan, and sort. For example, the transform algorithm is a derivative of for_each while the inner product is implemented with reduce.
推力算法在要处理的数据类型和要应用于数据的操作方面都是通用的。例如,可以采用归约算法来计算整数范围的总和(应用于整型数据的正归约)或浮点值范围的最大值(应用于浮点数据的最大归约)。这种通用性是通过 C++ 模板实现的,除了int或float等内置类型或plus等推力运算符之外,它还允许使用用户定义的类型和函数。
Thrust algorithms are generic in both the type of the data to be processed and the operations to be applied to the data. For instance, the reduce algorithm may be employed to compute the sum of a range of integers (a plus reduction applied to int data) or the maximum of a range of floating-point values (a max reduction applied to float data). This generality is implemented via C++ templates, which allows user-defined types and functions to be used in addition to built-in types such as int or float, or Thrust operators such as plus.
通用算法非常有价值,因为准确预测用户将需要哪些特定类型和运算符是不切实际的。事实上,虽然算法的计算结构是固定的,但算法的实例化数量是无限的。然而,还值得一提的是,虽然 Thrust 的接口是通用的,但抽象为实现者提供了专门针对已知的重要用例的特定类型和操作的机会。这些机会可以静态地利用。
Generic algorithms are extremely valuable because it is impractical to anticipate precisely which particular types and operators a user will require. Indeed, while the computational structure of an algorithm is fixed, the number of instantiations of the algorithm is limitless. However, it is also worth mentioning that while Thrust’s interface is general, the abstraction affords implementors the opportunity to specialize for specific types and operations known to be important use cases. These opportunities may be exploited statically.
在 Thrust 中,用户定义的操作采用 C++ 函数对象或函子的形式。函子允许程序员采用通用算法来执行特定的用户定义操作。例如,图 16.4中的代码示例分别使用 CUDA C 和 Thrust 实现 SAXPY(众所周知的 BLAS 操作)。 CUDA C代码大家应该很熟悉,提供来进行比较。
In Thrust, user-defined operations take the form of C++ function objects, or functors. Functors allow the programmer to adapt a generic algorithm to perform a specific user-defined operation. For example, the code samples in Figure 16.4 implement SAXPY, the well-known BLAS operation, using CUDA C and Thrust, respectively. The CUDA C code should be very familiar and is provided for comparison.
图 16.4 (a) CUDA C 和 (b) Thrust 中的 SAXPY 实现。
Figure 16.4 SAXPY implementations in (a) CUDA C and (b) Thrust.
信任代码有两部分。在第一部分中,代码设置了一个 SAXPY 函子,该函子接收输入浮点值a并将其维护为一个状态。然后可以将其称为对两个输入值x和y执行a*x +y的运算符。最后,使用用户定义的saxpy_functor func调用通用变换算法。提供给的迭代器变换算法会将func应用于每对x和y元素并生成 SAXPY 结果。请注意, saxpy_functor声明中定义的运算符可以重载,以便可以将不同类型的a、x、y传递到变换算法中,并且将调用正确的运算符来为每种类型的输入生成预期的输出值。这使得创建通用 SAXPY 函数成为可能。
The Trust code has two parts. In the first part, the code sets up a SAXPY functor that receives an input floating value a and maintains it as a state. It can then be called as an operator that performs a∗x +y on two input values x and y. Finally, the generic transform algorithm is called with the user-defined saxpy_functor func. The iterators provided to the transform algorithm will apply func to each pair of the x and y elements and produce the SAXPY results. Note that the operator defined in the saxpy_functor declaration can be overloaded so that different types of a, x, y can be passed into the transform algorithm and the correct operator will be invoked to generate the expected output values for each type of inputs. This makes it possible to create a generic SAXPY function.
C++ 函数对象
C++ Function Objects
AC 库开发人员可以通过允许用户提供回调函数来设置通用函数。例如,排序函数可以允许用户传递函数指针作为参数来执行比较操作,以确定两个输入值之间的顺序。这允许用户传递任何类型的输入,只要他或她可以定义两个输入值之间的比较函数即可。
A C library developer can set up a generic function by allowing the user to provide a callback function. For example, a sort function can allow the user to pass a function pointer as a parameter to perform the comparison operation for determining the order between two input values. This allows the user to pass any types of input as long as he or she can define a comparison function between two input values.
有时需要回调函数来维护状态。 C++ 函数对象(或函子)提供了一种便捷的方法来执行此操作。函子实际上是在保存状态的对象上定义的函数。作为回调函数传递的函数只是对象的类声明中定义的成员函数。对于saxpy_functor类,a是类数据,operator是在数据上定义的成员函数。当saxpy_functor func的实例传递给通用算法函数(例如transform() )时,将调用该运算符来对每对x和y元素进行操作。
It is sometimes desirable for a callback function to maintain a state. The C++ function object, or functor, provides a convenient way to do so. A functor is really a function defined on an object that holds a state. The function that is passed as the callback function is just a member function defined in the class declaration of the object. In the case of the saxpy_functor class, a is the class data and operator is the member function defined on the data. When an instance of saxpy_functor func is passed to a generic algorithm function such as transform(), the operator will be called to operate on each pair of x and y elements.
在本节中,我们将描述 Thrust 抽象层在程序员生产力、稳健性和实际性能方面的优势。
In this section we’ll describe the benefits of Thrust’s abstraction layer with respect to programmer productivity, robustness, and real-world performance.
Thrust 的高级算法通过自动将计算任务映射到 GPU 来提高程序员的工作效率。回想一下图 16.4中所示的 SAXPY 的两种实现。在 SAXPY 的 CUDA C 实现中,程序员描述了将并行向量运算具体分解为块网格,每个块有 256 个线程。相反,Thrust 实现不规定启动配置。相反,唯一的规范是输入和输出范围以及应用于它们的函子。除此之外,这两个代码在长度和代码复杂度方面大致相同。
Thrust’s high-level algorithms enhance programmer productivity by automating the mapping of computational tasks onto the GPU. Recall the two implementations of SAXPY shown in Figure 16.4. In the CUDA C implementation of SAXPY the programmer has described a specific decomposition of the parallel vector operation into a grid of blocks with 256 threads per block. In contrast, the Thrust implementation does not prescribe a launch configuration. Instead, the only specifications are the input and output ranges and a functor to apply to them. Otherwise, the two codes are roughly the same in terms of length and code complexity.
将启动配置委托给 Thrust 具有微妙而深远的含义:可以根据机器性能模型自动选择启动参数。目前,Thrust以最大占用为目标,将内核的资源使用情况(例如寄存器数量、共享内存量)与目标GPU的资源进行比较,以确定占用最高的启动配置。虽然最大占用启发式不一定是最优的,但它计算起来很简单并且在实践中有效。此外,没有什么可以阻止使用更复杂的性能模型。例如,可以在这个抽象后面引入检查硬件性能计数器的运行时调整系统,而无需更改客户端代码。
Delegating the launch configuration to Thrust has a subtle yet profound implication: the launch parameters can be automatically chosen based on a model of machine performance. Currently, Thrust targets maximal occupancy and will compare the resource usage of the kernel (e.g., number of registers, amount of shared memory) with the resources of the target GPU to determine a launch configuration with the highest occupancy. While the maximal occupancy heuristic is not necessarily optimal, it is straightforward to compute and effective in practice. Furthermore, there is nothing to preclude the use of more sophisticated performance models. For instance, a runtime tuning system that examined hardware performance counters could be introduced behind this abstraction without altering client code.
Thrust 还通过为常见模式提供丰富的算法集来提高程序员的工作效率。例如,Map-Reduce模式可以通过Thrust的sort by key和reduce by key算法方便地实现,它们分别实现键值排序和归约。
Thrust also boosts programmer productivity by providing a rich set of algorithms for common patterns. For instance, the map-reduce pattern is conveniently implemented with Thrust’s sort by key and reduce by key algorithms, which implement key-value sorting and reduction, respectively.
Thrust 的抽象层还增强了 CUDA 应用程序的稳健性。在上一节中我们注意到,通过将启动配置详细信息委托给 Thrust,我们可以在执行期间自动获得最大占用率。除了最大化占用率之外,抽象层还确保算法“正常工作”,即使在不常见或病态的用例中也是如此。例如,Thrust 自动处理网格尺寸的限制(当前设备中不超过 64 K),解决全局函数参数大小的限制,并适应大多数算法中的大型用户定义类型。 Thrust 在可能的程度上规避了这些因素,并确保在所有支持 CUDA 的设备上正确执行程序。
Thrust’s abstraction layer also enhances the robustness of CUDA applications. In the previous section we noted that by delegating the launch configuration details to Thrust we could automatically obtain maximum occupancy during execution. In addition to maximizing occupancy, the abstraction layer also ensures that algorithms “just work,” even in uncommon or pathological use cases. For instance, Thrust automatically handles limits on grid dimensions (no more than 64 K in current devices), works around limitations on the size of global function arguments, and accommodates large user-defined types in most algorithms. To the degree possible, Thrust circumvents such factors and ensures correct program execution across the full spectrum of CUDA-capable devices.
除了提高程序员的工作效率和提高稳健性之外,Thrust 提供的高级抽象还提高了实际用例中的性能。在本节中,我们将研究两个实例,其中 Thrust 的高级接口提供的自由裁量权被用来获得有意义的性能提升。
In addition to enhancing programmer productivity and improving robustness, the high-level abstractions provided by Thrust improve performance in real-world use cases. In this section we examine two instances where the discretion afforded by Thrust’s high-level interface is exploited for meaningful performance gains.
首先,考虑用特定值填充数组的操作。在 Thrust 中,这是通过填充算法实现的。不幸的是,直接实现这个看似简单的操作会带来严重的性能风险。回想一下,基于 G80 的处理器架构(即计算能力 1.0 和 1.1)施加了严格的条件,使内存访问模式可以从内存合并中受益[NVIDIA2010]。特别地,子字粒度(即小于4字节)的存储器访问不被这些处理器合并。初始化char或短类型数组时,此工件会降低性能。
To begin, consider the operation of filling an array with a particular value. In Thrust, this is implemented with the fill algorithm. Unfortunately, a straightforward implementation of this seemingly simple operation is subject to severe performance hazards. Recall that processors based on the G80 architecture (i.e., compute capability 1.0 and 1.1) impose strict conditions on which memory access patterns may benefit from memory coalescing [NVIDIA2010]. In particular, memory accesses of subword granularity (i.e., less than 4 bytes) are not coalesced by these processors. This artifact is detrimental to performance when initializing arrays of char or short types.
幸运的是,传递给fill 的迭代器隐式编码了拦截这种情况并替换优化实现所需的所有信息。具体来说,当为较小的类型分派fill时,Thrust 选择算法的“宽”版本,该算法为每个线程发出字大小的访问。虽然这种优化实施起来很简单,但用户不太可能自己投入精力进行这种优化。尽管如此,表 16.1所示的好处是值得的,特别是在早期的架构上。请注意,随着最新处理器上放宽的合并规则,优化的好处有所下降,但仍然很重要。
Fortunately, the iterators passed to fill implicitly encode all the information necessary to intercept this case and substitute an optimized implementation. Specifically, when fill is dispatched for smaller types, Thrust selects a “wide” version of the algorithm that issues word-sized accesses per thread. While this optimization is straightforward to implement, users are unlikely to invest the effort of making this optimization themselves. Nevertheless, the benefit, shown in Table 16.1, is worthwhile, particularly on earlier architectures. Note that with the relaxed coalescing rules on the more recent processors, the benefit of the optimization has somewhat decreased but is still significant.
表 16.1两个填充内核的内存带宽
Table 16.1 Memory Bandwidth of Two fill Kernels
与fill一样,Thrust 的排序功能利用了抽象排序和稳定排序函数提供的自由裁量权。只要算法达到了预期的结果,我们就可以自由地利用复杂的静态(编译时)和动态(运行时)优化来以最有效的方式实现排序操作。
Like fill, Thrust’s sorting functionality exploits the discretion afforded by the abstract sort and stable sort functions. As long as the algorithm achieves the promised result, we are free to utilize sophisticated static (compile-time) and dynamic (runtime) optimizations to implement the sorting operation in the most efficient manner.
正如16.3节中提到的,Thrust静态地选择一个高度优化的基数排序算法[MG2010] ,用于使用标准的less和greater比较运算符对基本类型(例如,char、int、float和double )进行排序。对于所有其他类型(例如,用户定义的数据类型)和比较运算符,Thrust 使用通用的合并排序算法。由于使用基数排序对原语进行排序比合并排序快得多,因此这种静态优化具有重要价值。
As mentioned in Section 16.3, Thrust statically selects a highly optimized radix sort algorithm [MG2010] for sorting primitive types (e.g., char, int, float, and double) with the standard less and greater comparison operators. For all other types (e.g., user-defined data types) and comparison operators, Thrust uses a general merge sort algorithm. Because sorting primitives with radix sort is considerably faster than merge sort, this static optimization has significant value.
Thrust 还应用动态优化来提高排序性能。由于基数排序的成本与有效密钥位的数量成正比,因此我们可以利用未使用的密钥位来降低排序的成本。例如,当所有整数键都在 [0, 16) 范围内时,只需对 4 位进行排序,并且与完整的 32 位排序相比,我们观察到 2.71 倍的加速。密钥位和基数排序性能之间的关系如图16.5所示。
Thrust also applies dynamic optimizations to improve sorting performance. Since the cost of radix sort is proportional to the number of significant key bits, we can exploit unused key bits to reduce the cost of sorting. For instance, when all integer keys are in the range [0, 16), only 4 bits must be sorted, and we observe a 2.71× speedup versus a full 32-bit sort. The relationship between key bits and radix sort performance is plotted in Figure 16.5.
图 16.5在 GeForce GTX480 上对整数进行排序:在密钥小于 32 位的常见用例中,Thrust 的动态排序优化可大幅提高性能。
Figure 16.5 Sorting integers on the GeForce GTX480: Thrust’s dynamic sorting optimizations improve performance by a considerable margin in common use cases where keys are less than 32 bits.
在本节中,我们重点介绍三种高级优化技术,程序员在使用 Thrust 时可以采用这些技术来显着提高性能。
In this section we highlight three high-level optimization techniques that programmers may employ to yield significant performance speedups when using Thrust.
现代 GPU 上计算资源的平衡意味着算法通常受到带宽限制。具体来说,CGMA(计算与全局内存访问)比率(每次内存访问的计算比率)较低的计算受到可用内存带宽的限制,并且不能充分利用 GPU 的计算资源。增加算法计算强度的一种技术是将多个流水线阶段融合到单个操作中。在本节中,我们将演示 Thrust 如何帮助开发人员利用内核融合机会并更好地利用 GPU 内存带宽。
The balance of computational resources on modern GPUs implies that algorithms are often bandwidth limited. Specifically, computations with low CGMA (Computation to Global Memory Access) ratio, the ratio of calculations per memory access, are constrained by the available memory bandwidth and do not fully utilize the computational resources of the GPU. One technique for increasing the computational intensity of an algorithm is to fuse multiple pipeline stages together into a single operation. In this section we demonstrate how Thrust enables developers to exploit opportunities for kernel fusion and better utilize GPU memory bandwidth.
核融合的最简单形式是标量函数复合。例如,假设我们有函数f ( x ) → y和g ( y ) → z,并且想要计算一系列标量值的g ( f ( x ))→ z 。最直接的方法是从内存中读取x,计算值y = f ( x ),然后将y写入内存,然后执行相同的操作来计算z = g ( y )。在 Thrust 中,这种方法将通过两次单独调用变换算法来实现,一次调用f,一次调用g。虽然这这种方法很容易理解和实现,但它不必要地浪费了内存带宽,而内存带宽是一种稀缺资源。
The simplest form of kernel fusion is scalar function composition. For example, suppose we have the functions f (x) → y and g(y) → z and would like to compute g(f (x))→ z for a range of scalar values. The most straightforward approach is to read x from memory, compute the value y = f (x), and then write y to memory, and then do the same to compute z = g(y). In Thrust this approach would be implemented with two separate calls to the transform algorithm, one for f and one for g. While this approach is straightforward to understand and implement, it needlessly wastes memory bandwidth, which is a scarce resource.
更好的方法是将函数融合为单个操作g ( f ( x )) 并将内存事务数量减半。除非f和g是计算量大的运算,否则融合实现的运行速度大约是第一种方法的两倍。一般来说,标量函数组合是一种有利可图的优化,应该广泛应用。
A better approach is to fuse the functions into a single operation g(f(x)) and halve the number of memory transactions. Unless f and g are computationally expensive operations, the fused implementation will run approximately twice as fast as the first approach. In general, scalar function composition is a profitable optimization and should be applied liberally.
Thrust 使开发人员能够利用其他不太明显的融合机会。例如,考虑图 16.6中所示的BLAS 函数snrm2的两个 Thrust 实现,它计算浮点向量的欧几里德范数。
Thrust enables developers to exploit other, less obvious opportunities for fusion. For example, consider the two Thrust implementations of the BLAS function snrm2 shown in Figure 16.6, which computes the Euclidean norm of a float vector.
图 16.6 snrm2 的算术强度较低,因此从融合中受益匪浅。
Figure 16.6 snrm2 has low arithmetic intensity and therefore benefits greatly from fusion.
请注意,snrm2的算术强度较低:向量的每个元素仅参与两次浮点运算 - 一次乘法(对值求平方)和一次加法(对值求和)。因此,使用变换归约算法(将平方变换与加数归约融合在一起)实现snrm2应该会快得多。事实上,这是事实,对于 Tesla C1060 上的 16 M 元素向量,snrm2 _fast 比snr2 _slow快 3.8 倍。
Note that snrm2 has low arithmetic intensity: each element of the vector participates in only two floating-point operations—one multiply (to square the value) and one addition (to sum values together). Therefore, an implementation of snrm2 using the transform reduce algorithm, which fuses the square transformation with a plus reduction, should be considerably faster. Indeed, this is true and snrm2_fast is fully 3.8 times faster than snr2_slow for a 16 M element vector on a Tesla C1060.
虽然前面的例子代表了一些更常见的融合机会,但我们只触及了表面。正如我们所看到的,将转换与其他算法融合是一种值得优化的方法。然而,如果所有算法都带有变换变体,Thrust 就会变得笨重。因此 Thrust 提供了变换迭代器,它允许变换与任何算法融合。事实上,变换归约只是变换迭代器和归约的适当组合的方便包装。类似地,Thrust 提供了排列迭代器,它使聚集和分散操作能够与其他算法融合。
While the previous examples represent some of the more common opportunities for fusion, we have only scratched the surface. As we have seen, fusing a transformation with other algorithms is a worthwhile optimization. However, Thrust would become unwieldy if all algorithms came with a transform variant. For this reason Thrust provides transform iterator, which allows transformations to be fused with any algorithm. Indeed, transform reduce is simply a convenience wrapper for the appropriate combination of transform iterator and reduce. Similarly, Thrust provides permutation iterator, which enables gather and scatter operations to be fused with other algorithms.
在上一节中,我们研究了融合如何最大限度地减少片外内存事务数量并节省带宽。提高内存效率的另一种方法是确保所有内存访问都受益于合并,因为合并的内存访问模式比非合并事务要快得多。
In the previous section we examined how fusion minimizes the number of off-chip memory transactions and conserves bandwidth. Another way to improve memory efficiency is to ensure that all memory accesses benefit from coalescing, since coalesced memory access patterns are considerably faster than noncoalesced transactions.
也许最常见的内存合并规则违规发生在使用所谓的结构数组 (AoS) 数据布局时。一般来说,对填充有 C结构体或 C++类变量的数组元素的访问将是不合并的。仅明确对齐结构(例如uint2或float4向量类型)满足内存合并规则。
Perhaps the most common violation of the memory coalescing rules arises when using a so-called array of structures (AoS) data layout. Generally speaking, access to the elements of an array filled with C struct or C++ class variables will be uncoalesced. Only explicitly aligned structures such as the uint2 or float4 vector types satisfy the memory coalescing rules.
AoS 布局的替代方案是数组结构 (SoA) 方法,其中每个结构的组件存储在单独的数组中。图 16.7说明了表示一系列 3D浮点向量的 AoS 和 SoA 方法。 SoA 方法的优点是对给定向量的x、y和z分量的常规访问是可合并的(因为float满足合并规则),而对 AoS 方法中的float3结构的常规访问则不是。
An alternative to the AoS layout is the structure of arrays (SoA) approach, where the components of each struct are stored in separate arrays. Figure 16.7 illustrates the AoS and SoA methods of representing a range of 3D float vectors. The advantage of the SoA method is that regular access to the x, y, and z components of a given vector is coalesceable (because float satisfies the coalescing rules), while regular access to the float3 structures in the AoS approach is not.
图 16.7 3D 浮点向量的数据布局:(a) AoS 和 (b) SoA。
Figure 16.7 Data layouts for 3D float vectors: (a) AoS and (b) SoA.
SoA 的问题在于没有任何东西可以在逻辑上将每个元素的成员封装到单个实体中。虽然我们可以立即将 Thrust 算法应用于 AoS 容器(例如device vector<float3> ),但我们没有直接方法对三个单独的device_vector<float>容器执行相同操作。幸运的是,Thrust 提供了zip 迭代器,它提供了 SoA 范围的封装。
The problem with SoA is that there is nothing to logically encapsulate the members of each element into a single entity. Whereas we could immediately apply Thrust algorithms to AoS containers like device vector<float3>, we have no direct means of doing the same with three separate device_vector<float> containers. Fortunately, Thrust provides zip iterator, which provides encapsulation of SoA ranges.
zip 迭代器 [ BIL]采用多个迭代器并将它们压缩到一个虚拟的元组范围中。例如,将三个device_vector<float>迭代器绑定在一起会生成一系列tuple<float,float,float>类型,这类似于float3结构。
The zip iterator [BIL] takes a number of iterators and zips them together into a virtual range of tuples. For instance, binding three device_vector<float> iterators together yields a range of type tuple<float,float,float>, which is analogous to the float3 structure.
考虑图 16.8中的代码示例,它使用zip 迭代器构建一系列以 SoA 格式存储的3D浮点向量。每个向量在再次写出之前都由旋转元组函子中的旋转矩阵进行变换。请注意,zip 迭代器用于输入和输出范围,透明地将底层标量范围打包到元组中,然后将元组解包到标量范围中。在特斯拉 C1060 上,SoA 实施比类似的 AoS 实施(未显示)快 2.85 倍。
Consider the code sample in Figure 16.8 that uses zip iterator to construct a range of 3D float vectors stored in SoA format. Each vector is transformed by a rotation matrix in the rotate tuple functor before being written out again. Note that zip iterator is used for both input and output ranges, transparently packing the underlying scalar ranges into tuples and then unpacking the tuples into the scalar ranges. On a Tesla C1060, SoA implementation is 2.85× faster than the analogous AoS implementation (not shown).
图 16.8 make_zip_iterator有助于处理数组格式的数据。
Figure 16.8 The make_zip_iterator facilitates processing of data in structure of arrays format.
在前面的部分中,我们考虑了有效转换值范围的方法以及从不同范围构造值的临时元组的方法。无论哪种情况,都会有一些基础数据显式存储在内存中。在本节中,我们将说明隐式范围的使用,即以编程方式定义值的范围,而不将其存储在内存中的任何位置。
In the previous sections we considered ways to efficiently transform ranges of values and ways to construct ad hoc tuples of values from separate ranges. In either case, there was some underlying data stored explicitly in memory. In this section we illustrate the use of implicit ranges, that is, ranges of which the values are defined programmatically and not stored anywhere in memory.
例如,考虑查找给定范围内具有最小值的元素的索引的问题。我们可以为该算法实现一个特殊的缩减内核,我们将其称为“最小索引”,但这将非常耗时且不必要。更好的方法是根据现有功能实现min_index ,例如对 ( value , index ) 元组进行专门的缩减,以实现所需的结果。具体来说,我们可以将值范围v[0]、v[1]、v[2]、:::与一系列整数索引0、1、2、:::一起压缩,以形成一系列元组( v[0], 0) , (v[1], 1) , (v[2],2) , :::然后使用以下方法实现最小索引标准归约算法。不幸的是,该方案将比定制的归约内核慢得多,因为必须创建索引范围并将其显式存储在内存中。
For instance, consider the problem of finding the index of the element with the smallest value in a given range. We could implement a special reduction kernel for this algorithm, which we’ll call min index, but that would be time consuming and unnecessary. A better approach is to implement min_index in terms of existing functionality, such as a specialized reduction over (value, index) tuples, to achieve the desired result. Specifically, we can zip the range of values v[0], v[1], v[2], ::: together with a range of integer indices 0, 1, 2, ::: to form a range of tuples (v[0], 0), (v[1], 1), (v[2],2), ::: and then implement min index with the standard reduce algorithm. Unfortunately, this scheme will be much slower than a customized reduction kernel, since the index range must be created and stored explicitly in memory.
为了解决这个问题,Thrust 提供了counting_iterator [BIL] ,它的作用就像我们需要实现min_index 的显式值范围一样,但不会带来任何开销。具体来说,当count_iterator被取消引用时,它会动态生成适当的值并将该值提供给调用者。使用计数迭代器的min_index的高效实现如图16.9所示。
To resolve this issue Thrust provides counting_iterator [BIL], which acts just like the explicit range of values we need to implement min_index, but does not carry any overhead. Specifically, when counting_iterator is dereferenced it generates the appropriate value on-the-fly and yields that value to the caller. An efficient implementation of min_index using counting iterator is shown in Figure 16.9.
图 16.9隐式范围通过节省内存带宽来提高性能。
Figure 16.9 Implicit ranges improve performance by conserving memory bandwidth.
1. 这里,计数迭代器使我们能够有效地实现专用的归约算法,而无需编写新的专用内核。除了计数迭代器之外,Thrust 还提供常量迭代器,它定义了隐式范围恒定值。请注意,这些隐式定义的迭代器可以与其他迭代器组合以创建更复杂的隐式范围。例如,计数迭代器可以与变换迭代器结合使用,以生成具有非单位步长的一系列索引。
1. Here counting iterator has allowed us to efficiently implement a special-purpose reduction algorithm without the need to write a new, special-purpose kernel. In addition to counting iterator, Thrust provides constant iterator, which defines an implicit range of constant value. Note that these implicitly defined iterators can be combined with the other iterators to create more complex implicit ranges. For instance, counting iterator can be used in combination with transform iterator to produce a range of indices with nonunit stride.
阅读图 16.9并使用小例子解释算法的操作。实际上,不需要实现最小索引,因为 Thrust 的最小元素算法提供了等效的功能。尽管如此,最小索引示例对于最佳实践还是有启发性的。事实上,Thrust 算法(例如min element、max element和find )是否在内部应用完全相同的策略。
Read Figure 16.9 and explain the operation of the algorithm using as small example. In practice, there is no need to implement min index since Thrust’s min element algorithm provides the equivalent functionality. Nevertheless the min index example is instructive of best practices. Indeed, Thrust algorithms such as min element, max element, and find if apply the exact same strategy internally.
1.Boost迭代器库。位于:< www.boost.org/doc/libs/release/libs/iterator/ >。
1. Boost Iterator Library. Available at: <www.boost.org/doc/libs/release/libs/iterator/>.
2. Hoberock J, Bell N. Thrust:并行模板库2011; [版本1.4.0]。
2. Hoberock J, Bell N. Thrust: A Parallel Template Library 2011; [version 1.4.0].
3. Merrill D. 和 Grimshaw, A.,重新审视 GPGPU 流架构的排序。技术报告 CS2010-03。弗吉尼亚大学计算机科学系,夏洛茨维尔。 2010年。
3. Merrill D., & Grimshaw, A., Revisiting sorting for GPGPU stream architectures. Technical Report CS2010-03. University of virginia, department of computer science, Charlottesville. 2010.
4. 英伟达公司。CUDA C 最佳实践指南 v3.2。加利福尼亚州圣克拉拉:NVIDIA 公司。 2010(第 3.2.1 节)。
4. NVIDIA Corporation. CUDA C best practices guide v3.2. Santa Clara, CA: NVIDIA Corporation. 2010 (Section 3.2.1).
5. Satish, N.、Harris, M. 和 Garland, M. 为多核 GPU 设计高效的排序算法。第二十三届 IEEE 国际并行与分布式处理研讨会论文集,IEEE 计算机学会。华盛顿特区:2009 年。
5. Satish, N., Harris, M. & Garland, M. Designing efficient orting algorithms for many-core GPUs. Proceedings twenty third IEEE international parallel and distributed processing symposium, IEEE computer society. Washington, DC: 2009.
17.1 CUDA FORTRAN 和 CUDA C 差异
17.1 CUDA FORTRAN and CUDA C Differences
17.2 第一个 CUDA FORTRAN 程序
17.2 A First CUDA FORTRAN Program
17.3 CUDA FORTRAN 中的多维数组
17.3 Multidimensional Array in CUDA FORTRAN
17.4 使用通用接口重载主机/设备例程
17.4 Overloading Host/Device Routines With Generic Interfaces
17.5 通过Iso_C_Binding调用 CUDA C
17.5 Calling CUDA C Via Iso_C_Binding
17.6 内核循环指令和归约操作
17.6 Kernel Loop Directives and Reduction Operations
17.7 动态共享内存
17.7 Dynamic Shared Memory
17.8 异步数据传输
17.8 Asynchronous Data Transfers
17.9 编译和分析
17.9 Compilation and Profiling
17.10 从 CUDA FORTRAN 调用 Thrust
17.10 Calling Thrust from CUDA FORTRAN
17.11 练习
17.11 Exercises
本章介绍 CUDA FORTRAN,即 CUDA 架构的 FORTRAN 接口。 CUDA FORTRAN 于 2009 年由 Portland Group (PGI) 和 NVIDIA 共同开发。 CUDA FORTRAN 与 CUDA C 有很多共同点,因为它基于运行时 API,但是,使用 FORTRAN 90 构造表达 CUDA 概念的方式存在一些差异。本章的第一部分从高层次讨论了 CUDA FORTRAN 和 CUDA C 之间的一些基本差异,后续部分使用各种示例来说明 CUDA FORTRAN 编程。
This chapter gives an introduction to CUDA FORTRAN, the FORTRAN interface to the CUDA architecture. CUDA FORTRAN was developed in 2009 as a joint effort between the Portland Group (PGI) and NVIDIA. CUDA FORTRAN shares much in common with CUDA C, as it is based on the runtime API, however, there are some differences in how the CUDA concepts are expressed using FORTRAN 90 constructs. The first section of this chapter discusses some of the basic differences between CUDA FORTRAN and CUDA C at a high level, and subsequent sections use various examples to illustrate CUDA FORTRAN programming.
CUDA FORTRAN 和 CUDA C 有很多共同点,因为 CUDA FORTRAN 基于 CUDA C 运行时 API。正如 CUDA C 是具有一些语言扩展的 C 语言一样,CUDA FORTRAN 是具有一组类似语言扩展的 FORTRAN。在我们开始讨论 CUDA FORTRAN 代码之前,总结一下这两种 CUDA 架构编程接口之间的一些差异是有帮助的。
CUDA FORTRAN and CUDA C have much in common, as CUDA FORTRAN is based on the CUDA C runtime API. Just as CUDA C is C with a few language extensions, CUDA FORTRAN is FORTRAN with a similar set of language extensions. Before we jump into CUDA FORTRAN code, it is helpful to summarize some of differences between these two programming interfaces to the CUDA architecture.
FORTRAN 是一种强类型语言,这种强类型会延续到 CUDA FORTRAN 实现中。 CUDA FORTRAN 主机代码中声明的设备数据是使用设备变量属性声明的,这与 CUDA C 不同,在 CUDA C 中主机和设备数据都以相同的方式声明。在声明变量时区分主机和设备数据可以简化处理设备数据的多个方面。例如,设备数据的分配可以发生在声明变量的地方
FORTRAN is a strongly typed language, and this strong typing carries over into the CUDA FORTRAN implementation. Device data declared in CUDA FORTRAN host code is declared with the device variable attribute, unlike CUDA C where both host and device data are declared the same way. Differentiating host and device data when variables are declared can simplify several aspects of dealing with device data. Allocation of device data can occur where the variable is declared, for example
真实,设备 :: a_d(N)
real, device :: a_d(N)
将分配a_d以包含设备 0 上的N 个元素。设备数据也可以声明为可分配的,并使用 FORTRAN 90 的allocate语句进行分配:
will allocate a_d to contain N elements on device 0. Device data can also be declared as allocatable, and allocated using the FORTRAN 90’s allocate statement:
真实、设备、可分配 :: a_d(:)
real, device, allocatable :: a_d(:)
……
…
分配(a_d(N))
allocate(a_d(N))
其中 FORTRAN分配例程已重载,以与 CUDA C 中cudaMalloc相同的方式在当前设备上分配数组。CUDA FORTRAN 的强类型也会影响主机和设备之间数据传输的执行方式。虽然可以使用CudaMemcpy函数执行主机到设备和设备到主机的阻塞传输,但使用赋值语句要容易得多:
where the FORTRAN allocate routine has been overloaded to allocate arrays on the current device in the same way cudaMalloc does in CUDA C. CUDA FORTRAN’s strong typing also affects how data transfers between the host and the device can be performed. While one can use the CudaMemcpy function to perform host-to-device and device-to-host blocking transfers, it is far easier to use assignment statements:
实数::a(N)
real :: a(N)
真实,设备 :: a_d(N)
real, device :: a_d(N)
……
…
a_d = a
a_d = a
其中 FORTRAN 数组赋值在幕后启动cudaMemcpy 。通过赋值语句传输仅适用于阻塞或同步传输;对于异步传输,必须使用cudaMemcpyAsync调用。
where the FORTRAN array assignment kicks off a cudaMemcpy behind the scenes. Transfer via assignment statements applies only to blocking or synchronous transfers; for asynchronous transfers one must use the cudaMemcpyAsync call.
除了设备属性之外,CUDA FORTRAN 还使用其他变量属性。属性共享、常量、固定和值在 CUDA FORTRAN 中也经常使用。设备代码中使用的共享内存使用共享变量属性,就像 CUDA C 使用__shared__限定符。常量内存必须在包含使用它的设备代码的 FORTRAN 模块中声明,并且该模块必须在初始化它的主机代码中使用。主机代码中常量数据的初始化是通过赋值语句而不是函数调用完成的。固定主机内存是使用固定变量属性声明的,并且还必须声明为可分配的。由于 FORTRAN 默认情况下通过引用传递数据,而在 CUDA 中,我们通常为主机和设备处理单独的内存空间,因此通过参数列表传递到内核的主机参数必须在内核中使用 value变量属性进行声明。
CUDA FORTRAN makes use of other variable attributes besides the device attribute. The attributes shared, constant, pinned, and value also find frequent use in CUDA FORTRAN. Shared memory used in device code uses the shared variable attribute just as CUDA C uses the __shared__ qualifier. Constant memory must be declared in a FORTRAN module that contains the device code where it is used, and the module must be used in the host code where it is initialized. The initialization of constant data in the host code is done via an assignment statement rather than by function calls. Pinned host memory is declared using the pinned variable attribute, and must also be declared allocatable. Since FORTRAN passes data by reference by default and in CUDA we typically deal with separate memory spaces for the host and the device, host parameters passed to a kernel via the argument list must be declared in the kernel with the value variable attribute.
CUDA FORTRAN 还使用attribute(global)和attribute(device)函数属性,就像 CUDA C 使用声明说明符__global__和__device__来声明内核和设备函数一样。
CUDA FORTRAN also uses the attributes(global) and attributes(device) function attributes in the same way CUDA C uses declaration specifiers __global__ and __device__ to declare kernels and device functions.
在 CUDA FORTRAN 设备代码中,预定义变量gridDim、blockDim、blockIdx和threadIdx与 CUDA C 中一样可用。遵循典型的 FORTRAN 约定,blockIdx和threadIdx的组件具有单位而不是 0 偏移量,因此典型索引计算如下所示:
Within CUDA FORTRAN device code the predefined variables gridDim, blockDim, blockIdx, and threadIdx are available as they are in CUDA C. Following typical FORTRAN convention, the components of blockIdx and threadIdx have a unit, rather than 0, offset, so a typical index calculation would look like the following:
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
i = blockDim%x ∗ (blockIdx%x - 1) + threadIdx%x
这与 CUDA C 形成对比:
This is in contrast to CUDA C’s:
i = blockDim.x*blockIdx.x + threadIdx.x;
i = blockDim.x∗blockIdx.x + threadIdx.x;
这总结了 CUDA C 和 CUDA FORTRAN 在 CUDA 概念表达上的主要差异。当我们在以下部分中查看几个示例时,CUDA FORTRAN 表示法将变得更加清晰。
This rounds out the major differences in the expression of CUDA concepts between CUDA C and CUDA FORTRAN. The CUDA FORTRAN notation will become clearer as we go through several examples in the following sections.
SAXPY 例程已被多次使用来说明 CUDA 编程的各个方面,我们在第一个 CUDA FORTRAN 示例中延续了这一传统:
The SAXPY routine has been used several times to illustrate various aspects of CUDA programming, and we continue this tradition with our first CUDA FORTRAN example:
数学运算模块
module mathOps
包含
contains
属性(全局)子程序 saxpy(x, y, a)
attributes(global) subroutine saxpy(x, y, a)
实数:: x(:), y(:)
real :: x(:), y(:)
真实值 :: a
real, value :: a
整数::i,n
integer :: i, n
n = 尺寸(x)
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
i = blockDim%x ∗ (blockIdx%x - 1) + threadIdx%x
如果 (i <= n) y(i) = y(i) + a*x(i)
if (i <= n) y(i) = y(i) + a∗x(i)
结束子程序 saxpy
end subroutine saxpy
结束模块 mathOps
end module mathOps
程序测试Saxpy
program testSaxpy
使用cudafor
use cudafor
使用数学运算
use mathOps
隐式无
implicit none
整数,参数 :: N = 40000
integer, parameter :: N = 40000
实数 :: x(N), y(N), a
real :: x(N), y(N), a
真实,设备 :: x_d(N), y_d(N)
real, device :: x_d(N), y_d(N)
类型(dim3)::网格,tBlock
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
tBlock = dim3(256,1,1)
网格 = dim3(天花板(真实(N)/tBlock%x),1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x = 1.0; y = 2.0; a = 2.0
x_d = x
x_d = x
y_d = y
y_d = y
调用 saxpy<<<grid,tBlock>>>(x_d, y_d, a)
call saxpy<<<grid,tBlock>>>(x_d, y_d, a)
y = y_d
y = y_d
write(*,*) '最大误差:', maxval(abs(y-4.0))
write(∗,∗) ’Max error: ’, maxval(abs(y-4.0))
结束程序测试Saxpy
end program testSaxpy
在这个完整的代码中,SAXPY 内核是使用attribute(global)限定符在 FORTRAN 模块mathOps中定义的。内核具有三个参数:一维数组x和y以及标量值a。x和y数组的大小不需要作为内核参数传递,因为x和y被声明为假定形状数组,允许使用FORTRAN size()内在函数。因为a是在主机上定义的并且必须按值传递,所以内核中 a 的声明中需要value变量属性。预定义的blockDim、blockIdx和threadIdx变量用于计算用于访问x和y元素的全局索引i。再次注意,blockIdx和threadIdx具有单位偏移量,而不是 CUDA C 的零偏移量。检查入站访问后,执行 SAXPY 操作。
In this complete code the SAXPY kernel is defined in the FORTRAN module mathOps using the attributes(global) qualifier. The kernel has three arguments: the 1D arrays x and y, and the scalar value a. The size of the x and y arrays does not need to be passed as a kernel argument since x and y are declared as assumed-shape arrays allowing the FORTRAN size() intrinsic to be used. Because a is defined on the host and must be passed by value, the value variable attribute is required in a’s declaration in the kernel. The predefined blockDim, blockIdx, and threadIdx variables are used to calculate a global index i used to access elements of x and y. Once again note that blockIdx and threadIdx have a unit offset as opposed to CUDA C’s zero offset. After checking for inbound access, the SAXPY operation is performed.
主机代码使用cudafor模块,该模块定义 CUDA 运行时 API 例程、常量和类型,例如用于声明执行配置变量grid和tBlock的 type(dim3)。在主机代码中,声明了主机数组x和y以及它们的设备对应项x_d和y_d,其中后者使用设备变量属性进行声明。线程块和网格在主机代码的第一个可执行行中定义,其中天花板函数用于在数组大小不满足的情况下启动足够的块来处理所有数组元素。可以被线程块中的线程数整除。主机数组x和y以及参数a初始化后,使用赋值语句x_d=x和y_d=y将数据从主机传输到设备。标量a不会以这种方式传递到设备,因为它是作为内核参数按值传递的。由于通过赋值语句进行的传输是阻塞传输,因此我们可以在传输后调用 SAXPY 内核,而无需任何同步。内核调用指定位于内核名称及其参数列表之间的三重 V 形中的执行配置,如 CUDA C 中所做的那样。与 CUDA C 类似,可以在三重 V 形之间使用整数表达式来代替类型(dim3)变量。接下来是结果数组的设备到主机传输,然后检查其正确性。
The host code uses the cudafor module, which defines CUDA runtime API routines, constants, and types, such as the type(dim3) used to declare the execution configuration variables grid and tBlock. In the host code, both host arrays x and y are declared as well as their device counterparts, x_d and y_d, where the latter are declared with the device variable attribute. The thread block and grid are defined in the first executable lines of host code, where the ceiling function is used to launch enough blocks to process all array elements in the case that the size of the array is not evenly divisible by the number of threads in a thread block. After the host arrays x and y, as well as the parameter a, are initialized, the assignment statements x_d=x and y_d=y are used to transfer the data from the host to the device. The scalar a is not passed to the device in this manner, as it is passed by value as a kernel argument. Since the transfers by assignment statement are blocking transfers, we can call the SAXPY kernel after the transfers without any synchronization. The kernel invocation specifies the execution configuration in the triple chevrons placed between the kernel name and its argument list as is done in CUDA C. Also similar to CUDA C, integer expressions can be used between the triple chevrons in place of the type(dim3) variables. This is followed by a device-to-host transfer of the resultant array, which is then checked for correctness.
多维数组是 FORTRAN 中的一等公民,FORTRAN 中处理多维数据的简便性也扩展到了 CUDA FORTRAN。我们已经在用于主机和设备之间传输的数组分配中看到了这一点的一方面。从以下矩阵乘法的 CUDA FORTRAN 实现中可以明显看出内核代码编程的简便性:
Multidimensional arrays are first-class citizens in FORTRAN, and the ease of dealing with multidimensional data in FORTRAN is extended to CUDA FORTRAN. We have already seen one aspect of this in array assignments used for transfers between the host and the device. The ease of programming kernel code is evident from the following CUDA FORTRAN implementation of matrix multiply:
数学运算模块
module mathOps
整数,参数::TILE_WIDTH = 16
integer, parameter :: TILE_WIDTH = 16
包含
contains
属性(全局)子程序matrixMul(Md, Nd, Pd)
attributes(global) subroutine matrixMul(Md, Nd, Pd)
隐式无
implicit none
真实,意图(输入):: Md(:,:), Nd(:,:)
real, intent(in) :: Md(:,:), Nd(:,:)
真实的,意图(输出):: Pd(:,:)
real, intent(out) :: Pd(:,:)
真实,共享 :: Mds(TILE_WIDTH, TILE_WIDTH)
real, shared :: Mds(TILE_WIDTH, TILE_WIDTH)
真实,共享 :: Nds(TILE_WIDTH, TILE_WIDTH)
real, shared :: Nds(TILE_WIDTH, TILE_WIDTH)
整数 :: i、j、k、m、tx、ty、宽度
integer :: i, j, k, m, tx, ty, width
真实 :: P 值
real :: Pvalue
tx = threadIdx%x; ty = 线程Idx%y
tx = threadIdx%x; ty = threadIdx%y
i = (blockIdx%x-1)*TILE_WIDTH + tx
i = (blockIdx%x-1)∗TILE_WIDTH + tx
j = (blockIdx%y-1)*TILE_WIDTH + ty
j = (blockIdx%y-1)∗TILE_WIDTH + ty
宽度 = 尺寸(Md,2)
width = size(Md,2)
P值 = 0.0
Pvalue = 0.0
do m = 1,宽度,TILE_WIDTH
do m = 1, width, TILE_WIDTH
Mds(tx,ty) = Md(i,m+ty-1)
Mds(tx,ty) = Md(i,m+ty-1)
调用同步线程()
call syncthreads()
k = 1,TILE_WIDTH
do k = 1, TILE_WIDTH
P值 = P值 + Mds(tx,k)*Nds(k,ty)
Pvalue = Pvalue + Mds(tx,k)∗Nds(k,ty)
恩多
enddo
调用同步线程()
call syncthreads()
恩多
enddo
Pd(i,j) = P值
Pd(i,j) = Pvalue
结束子程序矩阵Mul
end subroutine matrixMul
结束模块 mathOps
end module mathOps
程序测试矩阵乘法
program testMatrixMultiply
使用cudafor
use cudafor
使用数学运算
use mathOps
隐式无
implicit none
整数,参数 :: m=4*TILE_WIDTH,n=6*TILE_WIDTH,k=2*TILE_WIDTH
integer, parameter :: m=4∗TILE_WIDTH, n=6∗TILE_WIDTH, k=2∗TILE_WIDTH
实数:: a(m,k), b(k,n), c(m,n), c2(m,n)
real :: a(m,k), b(k,n), c(m,n), c2(m,n)
真实,设备 :: a_d(m,k), b_d(k,n), c_d(m,n)
real, device :: a_d(m,k), b_d(k,n), c_d(m,n)
类型(dim3)::网格,tBlock
type(dim3) :: grid, tBlock
调用 random_number(a); a_d = a
call random_number(a); a_d = a
调用随机数(b); b_d = b
call random_number(b); b_d = b
tBlock = dim3(TILE_WIDTH, TILE_WIDTH, 1)
tBlock = dim3(TILE_WIDTH, TILE_WIDTH, 1)
网格 = dim3(m/TILE_WIDTH, n/TILE_WIDTH, 1)
grid = dim3(m/TILE_WIDTH, n/TILE_WIDTH, 1)
调用matrixMul<<<grid, tBlock>>>(a_d, b_d, c_d)
call matrixMul<<<grid, tBlock>>>(a_d, b_d, c_d)
c = c_d
c = c_d
!针对 FORTRAN 90 matmul 内在函数进行测试
! test against FORTRAN 90 matmul intrinsic
c2 = matmul(a, b)
c2 = matmul(a, b)
write(*,*) '最大误差:', maxval(abs(c-c2))
write(∗,∗) ’max error: ’, maxval(abs(c-c2))
结束程序测试MatrixMultiply
end program testMatrixMultiply
就像在 CUDA C 代码中一样,matrixMul内核使用共享内存块Mds和Nds ,但是,将 2D 数组作为内核参数传递可以在复制到共享内存时在全局数组Md和Nd上进行更直观的索引。
The matrixMul kernel uses shared memory tiles Mds and Nds just as in the CUDA C code, however, passing in 2D arrays as kernel arguments allows for a more intuitive indexing on the global arrays Md and Nd when copying to shared memory.
在前面的矩阵乘法中,我们使用 FORTRAN 90 matmul内在函数来检查我们的结果。由于主机代码中主机和设备数据之间的区别,可以构建通用接口重载例程在主机或设备上执行,具体取决于参数是主机数据还是设备数据。为了说明这是如何完成的,我们为上一节中的矩阵乘法示例提供了一个通用接口:
In the preceding matrix multiplication, we used the FORTRAN 90 matmul intrinsic to check our results. Because of the distinction between host and device data in the host code, it is possible to build generic interfaces that overload routines to execute either on the host or on the device depending on whether the arguments are host or device data. To illustrate how this is done, we present a generic interface to the matrix multiplication example in the previous section:
数学运算模块
module mathOps
整数,参数::TILE_WIDTH = 16
integer, parameter :: TILE_WIDTH = 16
接口矩阵乘法
interface matrixMultiply
模块程序 mmCPU、mmGPU
module procedure mmCPU, mmGPU
最终接口矩阵乘法
end interface matrixMultiply
包含
contains
函数 mmCPU(a, b) 结果(c)
function mmCPU(a, b) result(c)
隐式无
implicit none
真实 :: a(:,:), b(:,:), c(:,:)
real :: a(:,:), b(:,:), c(:,:)
c = matmul(a,b)
c = matmul(a,b)
结束函数 mmCPU
end function mmCPU
函数 mmGPU(a_d, b_d) 结果(c)
function mmGPU(a_d, b_d) result(c)
使用cudafor
use cudafor
隐式无
implicit none
真实的,设备 :: a_d(:,:), b_d(:,:)
real, device :: a_d(:,:), b_d(:,:)
真实 :: c(:,:)
real :: c(:,:)
真实、设备、可分配 :: c_d(:,:)
real, device, allocatable :: c_d(:,:)
整数:: m, n
integer :: m, n
类型(dim3)::网格,tBlock
type(dim3) :: grid, tBlock
m = 大小(c,1); n = 大小(c,2)
m = size(c,1); n = size(c,2)
分配(c_d(m,n))
allocate(c_d(m,n))
tBlock = dim3(TILE_WIDTH, TILE_WIDTH, 1)
tBlock = dim3(TILE_WIDTH, TILE_WIDTH, 1)
网格 = dim3(m/TILE_WIDTH, n/TILE_WIDTH, 1)
grid = dim3(m/TILE_WIDTH, n/TILE_WIDTH, 1)
调用matrixMul<<<grid, tBlock>>>(a_d, b_d, c_d)
call matrixMul<<<grid, tBlock>>>(a_d, b_d, c_d)
c = c_d
c = c_d
解除分配(c_d)
deallocate(c_d)
结束函数 mmGPU
end function mmGPU
属性(全局)子程序matrixMul(Md, Nd, Pd)
attributes(global) subroutine matrixMul(Md, Nd, Pd)
隐式无
implicit none
真实,意图(输入):: Md(:,:), Nd(:,:)
real, intent(in) :: Md(:,:), Nd(:,:)
真实的,意图(输出):: Pd(:,:)
real, intent(out) :: Pd(:,:)
真实,共享 :: Mds(TILE_WIDTH, TILE_WIDTH)
real, shared :: Mds(TILE_WIDTH, TILE_WIDTH)
真实,共享 :: Nds(TILE_WIDTH, TILE_WIDTH)
real, shared :: Nds(TILE_WIDTH, TILE_WIDTH)
整数 :: i、j、k、m、tx、ty、宽度
integer :: i, j, k, m, tx, ty, width
真实 :: P 值
real :: Pvalue
tx = threadIdx%x; ty = 线程Idx%y
tx = threadIdx%x; ty = threadIdx%y
i = (blockIdx%x-1)*TILE_WIDTH + tx
i = (blockIdx%x-1)∗TILE_WIDTH + tx
j = (blockIdx%y-1)*TILE_WIDTH + ty
j = (blockIdx%y-1)∗TILE_WIDTH + ty
宽度 = 尺寸(Md,2)
width = size(Md,2)
P值 = 0.0
Pvalue = 0.0
do m = 1,宽度,TILE_WIDTH
do m = 1, width, TILE_WIDTH
Mds(tx,ty) = Md(i,m+ty-1)
Mds(tx,ty) = Md(i,m+ty-1)
Nds(tx,ty) = Nd(m+tx-1,j)
Nds(tx,ty) = Nd(m+tx-1,j)
调用同步线程()
call syncthreads()
k = 1,TILE_WIDTH
do k = 1, TILE_WIDTH
P值 = P值 + Mds(tx,k)*Nds(k,ty)
Pvalue = Pvalue + Mds(tx,k)∗Nds(k,ty)
恩多
enddo
调用同步线程()
call syncthreads()
恩多
enddo
Pd(i,j) = P值
Pd(i,j) = Pvalue
结束子程序矩阵Mul
end subroutine matrixMul
结束模块 mathOps
end module mathOps
程序测试矩阵乘法
program testMatrixMultiply
使用cudafor
use cudafor
使用数学运算
use mathOps
隐式无
implicit none
整数,参数 :: m=4*TILE_WIDTH,n=6*TILE_WIDTH,k=2*TILE_WIDTH
integer, parameter :: m=4∗TILE_WIDTH, n=6∗TILE_WIDTH, k=2∗TILE_WIDTH
实数:: a(m,k), b(k,n), c(m,n), c2(m,n)
real :: a(m,k), b(k,n), c(m,n), c2(m,n)
真实,设备 :: a_d(m,k), b_d(k,n)
real, device :: a_d(m,k), b_d(k,n)
调用 random_number(a); a_d = a
call random_number(a); a_d = a
调用随机数(b); b_d = b
call random_number(b); b_d = b
c = 矩阵乘法(a_d, b_d)
c = matrixMultiply(a_d, b_d)
c2 = 矩阵乘法(a, b)
c2 = matrixMultiply(a, b)
write(*,*) '最大误差:', maxval(abs(c-c2))
write(∗,∗) ’max error: ’, maxval(abs(c-c2))
结束程序测试MatrixMultiply
end program testMatrixMultiply
在此代码中,使用模块中定义的两个过程mmCPU和mmGPU来重载matrixMultiply的接口。mmCPU对主机数据进行操作,并简单地调用 FORTRAN 90 内在matmul。mmGPU获取输入矩阵的设备数据,并返回包含结果的主机数组。 (它可以很容易地定义为返回一个设备数组。)用于mmGPU中结果的设备数组c_d是一个本地数组,在mmGPU的第六行上声明,并在该例程的第十行上分配。分配后,确定本地定义的执行配置参数并启动内核,然后进行设备到主机的传输并取消分配c_d。实际矩阵多内核与上一节相比没有修改。在主机代码中,matrixMultiply用于访问这两个例程。
The interface to matrixMultiply in this code is overloaded using two procedures defined in the module, mmCPU and mmGPU. mmCPU operates on host data and simply calls the FORTRAN 90 intrinsic matmul. mmGPU takes device data for the input matrices, and returns a host array with the result. (It could just have easily been defined to return a device array.) The device array used for the result in mmGPU, c_d, is a local array that is declared on the sixth line of mmGPU, and allocated on the tenth line of that routine. After this allocation, the locally defined execution configuration parameters are determined and the kernel is launched, which is followed by a device-to-host transfer and the de-allocation of c_d. The actual matrix multiple kernel is not modified from the previous section. In the host code, matrixMultiply is used to access both of these routines.
在上一节中,我们演示了如何使用接口来允许单个调用根据输入数据所在的位置在主机或设备上执行操作。接口还可用于使用FORTRAN 2003 中引入的iso_c_binding模块从 CUDA FORTRAN 调用 C 或 CUDA C 函数。此类函数可以是用户开发的 CUDA C 例程,也可以是库例程。例如,在我们的矩阵乘法代码中,我们可能希望调用 SGEMM 的 CUBLAS 版本,而不是我们的手工编码版本。这可以通过以下方式完成:
In the previous section we demonstrated how an interface can be used to allow a single call to perform operations on either the host or device depending on where the input data resides. An interface can also be used to call C or CUDA C functions from CUDA FORTRAN using the iso_c_binding module introduced in FORTRAN 2003. Such functions can either be CUDA C routines developed by the user or library routines. In our matrix multiplication code, for example, we might wish to call the CUBLAS version of SGEMM rather than our hand-coded version. This can be done in the following manner:
cublas_m模块
module cublas_m
接口 cublasInit
interface cublasInit
整数函数 cublasInit() bind(C,name='cublasInit')
integer function cublasInit() bind(C,name=’cublasInit’)
结束函数 cublasInit
end function cublasInit
终端接口
end interface
cublasSgemm 接口
interface cublasSgemm
子程序 cublasSgemm(cta,ctb,m,n,k,alpha,A,lda,B,ldb,beta,c,ldc) &
subroutine cublasSgemm(cta,ctb,m,n,k,alpha,A,lda,B,ldb,beta,c,ldc) &
绑定(C,名称='cublasSgemm')
bind(C,name=’cublasSgemm’)
使用 iso_c_binding
use iso_c_binding
字符(1,c_char),值 :: cta,ctb
character(1,c_char), value :: cta, ctb
整数(c_int),值:: k, m, n, lda, ldb, ldc
integer(c_int), value :: k, m, n, lda, ldb, ldc
真实(c_float),值::阿尔法,贝塔
real(c_float), value :: alpha, beta
real(c_float), 设备 :: A(lda,*), B(ldb,*), C(ldc,*)
real(c_float), device :: A(lda,∗), B(ldb,∗), C(ldc,∗)
结束子程序 cublasSgemm
end subroutine cublasSgemm
终端接口 cublasSgemm
end interface cublasSgemm
结束模块 cublas_m
end module cublas_m
程序 sgemmDevice
program sgemmDevice
使用 cublas_m
use cublas_m
使用cudafor
use cudafor
隐式无
implicit none
整数,参数 :: m = 100,n = 100,k = 100
integer, parameter :: m = 100, n = 100, k = 100
实数:: a(m,k), b(k,n), c(m,n), c2(m,n)
real :: a(m,k), b(k,n), c(m,n), c2(m,n)
真实,设备 :: a_d(m,k), b_d(k,n), c_d(m,n)
real, device :: a_d(m,k), b_d(k,n), c_d(m,n)
实数,参数 :: alpha = 1.0,beta = 0.0
real, parameter :: alpha = 1.0, beta = 0.0
整数 :: lda = m,ldb = k,ldc = m
integer :: lda = m, ldb = k, ldc = m
整数::istat
integer :: istat
call random_number(a); a_d = a
调用随机数(b); b_d = b
call random_number(b); b_d = b
istat = cublasInit()
istat = cublasInit()
调用 cublasSgemm('n','n',m,n,k,alpha,a_d,lda,b_d,ldb,beta,c_d,ldc)
call cublasSgemm(’n’,’n’,m,n,k,alpha,a_d,lda,b_d,ldb,beta,c_d,ldc)
c = c_d
c = c_d
c2 = matmul(a,b)
c2 = matmul(a,b)
write(*,*) '最大误差=', maxval(abs(c-c2))
write(∗,∗) ’max error =’, maxval(abs(c-c2))
结束程序 sgemmDevice
end program sgemmDevice
这里,模块cublas_m包含 CUBLAS 例程cublasInit和cublasSgemm的接口,它们按照bind(C,name='...')子句的规定绑定到 C 函数。iso_c_binding模块在cublasSgemm接口中使用,因为该模块包含函数参数声明中使用的类型参数。
Here the module cublas_m contains interfaces for the CUBLAS routines cublasInit and cublasSgemm, which are bound to C functions as dictated by the bind(C,name=’…’) clause. The iso_c_binding module is used in the cublasSgemm interface as this module contains the type kind parameters used in the declarations for the function arguments.
人们可以为所有 CUBLAS 例程手动编写这些接口,但这已经在PGI CUDA FORTRAN 编译器提供的cublas模块中完成。在上面的代码中,我们可以简单地删除cublas_m模块并将use cublas_m更改为在主程序中使用cublas 。 cublas模块还包含通用接口,用于重载标准 BLAS 函数,以便在数组参数是设备数组时执行 CUBLAS 版本。所以我们可以进一步修改前面的程序,调用sgemm而不是cublasSgemm。那么完整的程序就变成如下:
One could manually write these interfaces for all of the CUBLAS routines, but this has already been done in the cublas module provided with the PGI CUDA FORTRAN compiler. In the preceding code, one can simply remove the cublas_m module and change the use cublas_m to use cublas in the main program. The cublas module also contains generic interfaces to overload the standard BLAS functions to execute the CUBLAS versions when the array arguments are device arrays. So we can further change the preceding program to call sgemm rather than cublasSgemm. The complete program then becomes as follows:
程序 sgemmDevice
program sgemmDevice
使用古巴斯
use cublas
使用cudafor
use cudafor
隐式无
implicit none
整数,参数 :: m = 100,n = 100,k = 100
integer, parameter :: m = 100, n = 100, k = 100
实数:: a(m,k), b(k,n), c(m,n), c2(m,n)
real :: a(m,k), b(k,n), c(m,n), c2(m,n)
真实,设备 :: a_d(m,k), b_d(k,n), c_d(m,n)
real, device :: a_d(m,k), b_d(k,n), c_d(m,n)
实数,参数 :: alpha = 1.0,beta = 0.0
real, parameter :: alpha = 1.0, beta = 0.0
整数 :: lda = m,ldb = k,ldc = m
integer :: lda = m, ldb = k, ldc = m
整数::istat
integer :: istat
调用 random_number(a); a_d = a
call random_number(a); a_d = a
调用随机数(b); b_d = b
call random_number(b); b_d = b
istat = cublasInit()
istat = cublasInit()
调用 sgemm('n','n',m,n,k,alpha,a_d,lda,b_d,ldb,beta,c_d,ldc)
call sgemm(’n’,’n’,m,n,k,alpha,a_d,lda,b_d,ldb,beta,c_d,ldc)
c = c_d
c = c_d
c2 = matmul(a,b)
c2 = matmul(a,b)
write(*,*) '最大误差=', maxval(abs(c-c2))
write(∗,∗) ’max error =’, maxval(abs(c-c2))
在很多情况下,人们希望对设备数据执行简单的操作,例如设备阵列的缩放或标准化。对于此类操作,编写单独的内核可能会很麻烦,幸运的是 CUDA FORTRAN 提供了内核 CUDA FORTRAN 循环指令,或 CUF 内核。 CUF 内核本质上允许程序员在主机代码中内联简单的内核。例如,我们使用 CUF 内核的 SAXPY 代码变为
There are many occasions when one wishes to perform simple operations on device data, such as scaling or normalization of a device array. For such operations, it can be cumbersome to write separate kernels, and fortunately CUDA FORTRAN provides kernel CUDA FORTRAN loop directives, or CUF kernels. CUF kernels essentially allow the programmer to inline simple kernels in host code. For example, our SAXPY code using CUF kernels becomes
程序测试Saxpy
program testSaxpy
使用cudafor
use cudafor
隐式无
implicit none
整数,参数 :: N = 40000
integer, parameter :: N = 40000
实数 :: x(N), y(N), a
real :: x(N), y(N), a
真实,设备 :: x_d(N), y_d(N)
real, device :: x_d(N), y_d(N)
整数::我
integer :: i
x = 1.0; x_d = x
x = 1.0; x_d = x
y = 2.0; y_d = y
y = 2.0; y_d = y
a = 2.0
a = 2.0
!$cuf 内核做 <<<*,*>>>
!$cuf kernel do <<<∗,∗>>>
做我 = 1, N
do i = 1, N
y_d(i) = y_d(i) + a*x_d(i)
y_d(i) = y_d(i) + a∗x_d(i)
结束做
end do
y = y_d
y = y_d
write(*,*) '最大误差:', maxval(abs(y-4.0))
write(∗,∗) ’Max error: ’, maxval(abs(y-4.0))
结束程序测试Saxpy
end program testSaxpy
在这个完整的代码中,包含saxpy内核的模块已被删除,在主机代码中的位置是包含设备数组的循环。指令!$cuf kernel do通知编译器为下面的do循环中的操作生成内核。可以在<<<...,...>>>中手动指定执行配置,或者可以使用星号让编译器选择执行,就像本例中所做的那样。 CUF 内核可以在嵌套循环上运行,并且可以使用非默认流。
In this complete code, the module containing the saxpy kernel has been removed and in its place in the host code is the loop that contains device arrays. The directive !$cuf kernel do informs the compiler to generate a kernel for the operation in the following do loop. The execution configuration can be manually specified in the <<<…,…>>>, or asterisks can be used to have the compiler choose an execution, as is done in this case. CUF kernels can operate on nested loops, and can use nondefault streams.
CUF 内核的一个特别有用的方面是它们执行归约的能力。当 CUF 内核循环中表达式的左侧是主机标量变量时,将在设备上执行归约操作。这很有用,因为在 CUDA 中编写性能良好的缩减并不是一个简单的任务。小事。使用编译器生成的 CUF 内核计算设备数组元素之和如下所示:
One particular useful aspect of CUF kernels is their ability to perform reductions. When the left side of an expression in a CUF kernel loop is a host scalar variable, a reduction operation is performed on the device. This is useful because coding a well-performing reduction in CUDA is not a trivial matter. The calculation of the sum of the device array elements using compiler-generated CUF kernels looks like the following:
程序测试缩减
program testReduction
使用cudafor
use cudafor
隐式无
implicit none
整数,参数 :: N = 40000
integer, parameter :: N = 40000
实数:: x(N), xsum
real :: x(N), xsum
真实,设备 :: x_d(N)
real, device :: x_d(N)
整数::我
integer :: i
x = 1.0; x_d = x
x = 1.0; x_d = x
总和 = 0.0
xsum = 0.0
!$cuf 内核做 <<<*,*>>>
!$cuf kernel do <<<∗,∗>>>
做我 = 1, N
do i = 1, N
xsum = xsum + x_d(i)
xsum = xsum + x_d(i)
结束做
end do
write(*,*) '错误:',abs(xsum - sum(x))
write(∗,∗) ’Error: ’, abs(xsum - sum(x))
结束程序测试缩减
end program testReduction
在我们的矩阵乘法示例中,我们演示了如何使用静态共享内存,这本质上类似于在 CUDA C 中声明它的方式。对于动态共享内存,CUDA FORTRAN 中有多个选项。如果使用单个动态共享内存阵列,则 CUDA FORTRAN 实现再次与 CUDA C 中的操作类似:
In our matrix multiplication example we demonstrated how static shared memory is used, which is essentially analogous to how it is declared in CUDA C. For dynamic shared memory, there are several options in CUDA FORTRAN. If a single dynamic shared memory array is used, then once again the CUDA FORTRAN implementation parallels what is done in CUDA C:
属性(全局)子程序dynamicReverse1(d)
attributes(global) subroutine dynamicReverse1(d)
真实 :: d(:)
real :: d(:)
整数 :: t, tr
integer :: t, tr
实数,共享 :: s(*)
real, shared :: s(∗)
t = 线程Idx%x
t = threadIdx%x
tr = 大小(d)-t+1
tr = size(d)-t+1
s(t) = d(t)
s(t) = d(t)
调用同步线程()
call syncthreads()
d(t) = s(tr)
d(t) = s(tr)
结束子程序dynamicReverse1
end subroutine dynamicReverse1
其中共享内存数组s用于反转该内核中单个线程块数组的元素,被声明为假定大小的数组。该动态共享内存数组的大小由以下确定:第三个执行配置参数中指定的动态共享内存的字节数:
where the shared memory array s, used to reverse elements of a single thread block array in this kernel, is declared with as an assumed-size array. The size of this dynamic shared memory array is determined from the number of bytes of dynamic shared memory specified in the third execution configuration parameter:
线程块 = dim3(n,1,1)
threadBlock = dim3(n,1,1)
网格 = dim3 (1 ,1 ,1)
grid = dim3 (1 ,1 ,1)
……
…
调用dynamicReverse1<<<grid,threadBlock,4*threadBlock%x>>>(d_d)
call dynamicReverse1<<<grid,threadBlock,4∗threadBlock%x>>>(d_d)
当在 CUDA C 中使用多个动态共享内存阵列时,本质上会分配一大块内存,并使用指针算术来确定各个变量到该块的偏移量。在 CUDA FORTRAN 中,使用自动数组:
When multiple dynamic shared memory arrays are used in CUDA C, essentially one large block of memory is allocated and pointer arithmetic is used to determine offsets into this block for the various variables. In CUDA FORTRAN, automatic arrays are used:
属性(全局)子程序dynamicReverse2(d, nSize)
attributes (global) subroutine dynamicReverse2(d, nSize)
真实 :: d(nSize)
real :: d(nSize)
整数,值 :: nSize
integer, value :: nSize
整数 :: t, tr
integer :: t, tr
真实、共享 :: s(nSize)
real, shared :: s(nSize)
t = 线程Idx%x
t = threadIdx%x
tr = n尺寸-t+1
tr = nSize-t+1
s(t) = d(t)
s(t) = d(t)
调用同步线程()
call syncthreads()
d(t) = s(tr)
d(t) = s(tr)
结束子程序dynamicReverse2
end subroutine dynamicReverse2
这里nSize在编译时是未知的,因此s不是静态共享内存数组。任何范围内的变量(例如在包含该内核的模块中声明的变量)都可用于确定自动共享内存数组的大小。可以以这种方式指定不同类型的多个动态共享存储器阵列。动态共享内存的总量仍必须在第三个执行配置参数中指定。
Here nSize is not known at compile time, hence s is not a static shared memory array. Any in-scope variable, such as a variable declared in the module that contains this kernel, can be used to determine the size of the automatic shared memory arrays. Multiple dynamic shared memory arrays of different types can be specified in this fashion. The total amount of dynamic shared memory must still be specified in the third execution configuration parameter.
异步数据传输是使用cudaMemcpy*Async() API 调用执行的,就像在 CUDA C 中所做的那样,有一些差异不仅适用于这些异步数据传输 API 调用,还适用于同步cudaMemcpy*()变体。第一个区别是第三个参数中指定的传输大小是根据元素数而不是字节数来计算的,第二个区别是传输方向是可选参数,因为可以从以下位置推断出传输方向前两个参数的类型。
Asynchronous data transfers are performed using the cudaMemcpy∗Async() API calls as is done in CUDA C, with a couple of differences that apply not only to these asynchronous data transfer API calls but also to the synchronous cudaMemcpy∗() variants. The first difference is that the size of the transfer specified in the third argument is in terms of the number of elements rather than the number of bytes, and the second is that the direction of transfer is an optional argument since the direction can be inferred from the types of the first two arguments.
与 CUDA C 一样,对于异步传输,必须固定主机内存,这是通过固定变量属性而不是通过特定的分配函数来完成的。 CUDA FORTRAN 中的固定内存必须是可分配的,并且可以通过 FORTRAN 90 allocate()和deallocate()语句进行分配和取消分配。
As with CUDA C, for asynchronous transfers the host memory must be pinned, which is accomplished through the pinned variable attribute rather than through a specific allocation function. Pinned memory in CUDA FORTRAN must be allocatable, and can be allocated and de-allocated through the FORTRAN 90 allocate() and deallocate() statements.
为了重叠内核执行和数据传输,除了固定的主机内存之外,数据传输和内核还必须使用不同的非默认流。这种重叠需要非默认流,因为内存复制、内存设置函数和使用默认流的内核调用仅在设备(任何流中)上的所有先前调用完成后才开始,并且设备上(任何流中)没有任何操作)开始,直到完成。以下是重叠内核执行和数据传输的示例:
To overlap kernel execution and data transfers, in addition to pinned host memory, the data transfer and kernel must use different, nondefault streams. Nondefault streams are required for this overlap because memory copy, memory set functions, and kernel calls that use the default stream begin only after all preceding calls on the device (in any stream) have completed, and no operation on the device (in any stream) commences until they are finished. The following is an example of overlapping kernel execution and data transfer:
真实的、可分配的、固定的 :: a(:)
real, allocatable, pinned :: a(:)
……
…
整数(kind = cuda_stream_kind)::流1,流2
integer (kind=cuda_stream_kind) :: stream1, stream2
……
…
分配(a(nElements))
allocate(a(nElements))
istat = cudaStreamCreate(流1)
istat = cudaStreamCreate(stream1)
istat = cudaStreamCreate(流2)
istat = cudaStreamCreate(stream2)
istat = cudaMemcpyAsync(a_d , a, nElements, Stream1)
istat = cudaMemcpyAsync(a_d , a, nElements, stream1)
调用内核<<<gridSize,blockSize,0,stream2>>>(b_d)
call kernel <<<gridSize ,blockSize ,0, stream2 >>>(b_d)
在此示例中,创建了两个流,并在数据传输和内核执行中使用,如cudaMemcpyAsync()调用的最后一个参数和内核执行配置中所指定。我们使用两个设备数组a_d和b_d,并将a_d上的工作分配给Stream1,将b_d 上的工作分配给Stream2。
In this example, two streams are created and used in the data transfer and kernel executions as specified in the last arguments of the cudaMemcpyAsync() call and the kernels execution configuration. We make use of two device arrays, a_d and b_d, and assign work on a_d to stream1 and b_d to stream2.
如果内核中对单个数据数组的操作是独立的,那么数据可以被分成块并分多个阶段传输,多个内核在每个块到达时启动对它进行操作,并且当每个块的结果传输回主机时。相关内核完成。以下代码段演示了两种分解数据传输和内核工作以隐藏传输时间的方法:
If the operations on a single data array in a kernel are independent, then data can be broken into chunks and transferred in multiple stages, multiple kernels launched to operate on each chunk as it arrives, and each chunk’s results transferred back to the host when the relevant kernel completes. The following code segments demonstrate two ways of breaking up data transfers and kernel work to hide transfer time:
!基线情况 - 顺序传输和执行
! baseline case - sequential transfer and execute
a = 0
a = 0
istat = cudaEventRecord(startEvent,0)
istat = cudaEventRecord(startEvent ,0)
a_d = a
a_d = a
调用内核 <<<n/blockSize , blockSize >>>(a_d, 0)
call kernel <<<n/blockSize , blockSize >>>(a_d, 0)
a = a_d
a = a_d
istat = cudaEventRecord(stopEvent, 0)
istat = cudaEventRecord(stopEvent , 0)
! Setup for multiple stream processing
str大小 = n / nStreams
strSize = n / nStreams
strGridSize = strSize / 块大小
strGridSize = strSize / blocksize
i = 1,nStreams
i = 1, nStreams
istat = cudaStreamCreate(流(i))
istat = cudaStreamCreate(stream(i))
恩多
enddo
!异步版本 1:循环 {copy, kernel, copy}
! asynchronous version 1: loop over {copy, kernel, copy}
a = 0
a = 0
istat = cudaEventRecord(startEvent,0)
istat = cudaEventRecord(startEvent ,0)
我= 1,nStreams
do i = 1, nStreams
偏移量 = (i-1)* strSize
offset = (i-1)∗ strSize
istat=cudaMemcpyAsync(a_d(偏移+1), a(偏移+1), strSize, 流(i))
istat=cudaMemcpyAsync(a_d(offset+1), a(offset+1), strSize, stream(i))
调用内核 <<<strGridSize, blockSize, 0, Stream(i)>>>(a_d, offset)
call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
istat=cudaMemcpyAsync(a(偏移+1), a_d(偏移+1), strSize, 流(i))
istat=cudaMemcpyAsync(a(offset+1), a_d(offset+1), strSize, stream(i))
恩多
enddo
istat = cudaEventRecord(stopEvent, 0)
istat = cudaEventRecord(stopEvent , 0)
!异步版本2:
! asynchronous version 2:
!循环复制、循环内核、循环复制
! loop over copy, loop over kernel, loop over copy
a = 0
a = 0
istat = cudaEventRecord(startEvent,0)
istat = cudaEventRecord(startEvent ,0)
我= 1,nStreams
do i = 1, nStreams
偏移量 = (i-1)* strSize
offset = (i-1)∗ strSize
istat=cudaMemcpyAsync(a_d(偏移+1), a(偏移+1), strSize, 流(i))
istat=cudaMemcpyAsync(a_d(offset+1), a(offset+1), strSize, stream(i))
恩多
enddo
我= 1,nStreams
do i = 1, nStreams
偏移量 = (i-1)* strSize
offset = (i-1)∗ strSize
调用内核 <<<strGridSize, blockSize, 0, Stream(i)>>>(a_d, offset)
call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
恩多
enddo
我= 1,nStreams
do i = 1, nStreams
偏移量 = (i-1)* strSize
offset = (i-1)∗ strSize
istat = cudaMemcpyAsync(a(偏移+1), a_d(偏移+1), strSize,流(i))
istat = cudaMemcpyAsync(a(offset+1), a_d(offset+1), strSize ,stream(i))
恩多
enddo
istat = cudaEventRecord(stopEvent, 0)
istat = cudaEventRecord(stopEvent , 0)
异步情况与顺序情况类似,只是存在多个数据传输和内核启动,这些数据传输和内核启动通过不同的流和与特定流相对应的偏移量来区分。在这段代码中,我们将流的数量限制为四个,尽管对于大型数组来说,没有理由不能使用更多数量的流。用过的。请注意,代码中的顺序和异步情况使用相同的内核,因为偏移量被发送到内核以容纳不同流中的数据。两个异步版本之间的区别在于副本和内核的执行顺序。第一个版本在每个流上循环,并为每个流发出主机到设备的副本、内核和设备到主机的副本。第二个版本发出所有主机到设备的副本,然后是所有内核启动,然后是所有设备到主机的副本。我们还使用第三种方法,它是第二种方法的变体,在每次内核启动后记录一个虚拟事件:
The asynchronous cases are similar to the sequential case, only that there are multiple data transfers and kernel launches that are distinguished by different streams and an offset corresponding to the particular stream. In this code, we limit the number of streams to four, although for large arrays there is no reason why a larger number of streams could not be used. Note that the same kernel is used in the sequential and asynchronous cases in the code, as an offset is sent to the kernel to accommodate the data in different streams. The difference between the two asynchronous versions is the order in which the copies and kernels are executed. The first version loops over each stream and for each stream issues a host-to-device copy, kernel, and device-to-host copy. The second version issues all host-to-device copies, then all kernel launches, and then all device-to-host copies. We also make use of a third approach, which is a variant of the second where a dummy event is recorded after each kernel launch:
我= 1,nStreams
do i = 1, nStreams
偏移量 = (i-1)* strSize
offset = (i-1)∗ strSize
调用内核 <<<strGridSize, blockSize, 0, Stream(i)>>>(a_d, offset)
call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
!添加虚拟事件
! Add a dummy event
istat = cudaEventRecord(dummyEvent, 流(i))
istat = cudaEventRecord(dummyEvent, stream(i))
恩多
enddo
此时您可能会问为什么我们有三个版本的异步情况。原因是这些变体在不同的硬件上表现不同。在 NVIDIA Tesla C1060 上运行此代码会产生以下结果:
At this point you may be asking why we have three versions of the asynchronous case. The reason is that these variants perform differently on different hardware. Running this code on the NVIDIA Tesla C1060 produces the following:
设备:特斯拉 C1060
Device: Tesla C1060
顺序传输和执行的时间(ms):12.92381
Time for sequential transfer and execute (ms): 12.92381
异步V1传输和执行时间(ms):13.63690
Time for asynchronous V1 transfer and execute (ms): 13.63690
异步V2传输和执行时间(ms):8.845888
Time for asynchronous V2 transfer and execute (ms): 8.845888
异步V3传输和执行时间(ms):8.998560
Time for asynchronous V3 transfer and execute (ms): 8.998560
在 NVIDIA Tesla C2050 上,我们得到以下信息:
And on the NVIDIA Tesla C2050 we get the following:
设备:特斯拉 C2050
Device: Tesla C2050
顺序传输和执行的时间(ms):9.984512
Time for sequential transfer and execute (ms): 9.984512
异步V1传输和执行时间(ms):5.735584
Time for asynchronous V1 transfer and execute (ms): 5.735584
异步V2传输和执行时间(ms):7.597984
Time for asynchronous V2 transfer and execute (ms): 7.597984
异步V3传输和执行时间(ms):5.735424
Time for asynchronous V3 transfer and execute (ms): 5.735424
为了解读这些结果,我们需要更多地了解设备如何安排和执行各种任务。 CUDA 设备包含用于各种任务的引擎,并且操作在发出时在这些引擎中排队。不同引擎中的任务之间的依赖性得到维护,但在任何引擎内,所有依赖性都会丢失,因为引擎队列中的任务按照主机线程发出的顺序执行。例如,C1060 有一个复制引擎和一个内核引擎。对于前面的代码,在设备上执行的时间线如图17.1所示。在此示意图中,我们假设主机到设备传输、内核执行和设备到主机传输所需的时间为大致相同,并且在提供的代码中,选择了一个内核以使这些时间具有可比性。
To decipher these results we need to understand a bit more about how devices schedule and execute various tasks. CUDA devices contain engines for various tasks, and operations are queued up in these engines as they are issued. Dependencies between tasks in different engines are maintained, but within any engine all dependence is lost, as tasks in an engine’s queue are executed in the order they are issued by the host thread. For example, the C1060 has a single copy engine and a single kernel engine. For the preceding code, timelines for the execution on the device are schematically shown in Figure 17.1. In this schematic we have assumed that the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are approximately the same, and in the code provided, a kernel was chosen to make these times comparable.
图 17.1当只有一个复制引擎时,顺序和异步版本的数据传输和内核执行时序。
Figure 17.1 Data transfer and kernel execution timing for the sequential and asynchronous versions when there is only one copy engine.
对于顺序内核,任何操作都不会像人们所期望的那样重叠。对于我们代码的第一个异步版本,复制引擎中的执行顺序是:H2D 流(1)、D2H 流(1)、H2D 流(2)、D2H 流(2),依此类推。这就是为什么我们在 C1060 上使用第一个异步版本时没有看到任何加速:任务按照排除内核执行和数据传输重叠的顺序发送到复制引擎。然而,对于版本二和版本三,所有主机到设备传输都是在任何设备到主机传输之前发出的,如执行时间较短所示,重叠是可能的。从我们的示意图中,我们预计版本 2 和版本 3 的执行时间是顺序版本的 8/12,即 8.7 毫秒,这是在代码计时中观察到的。
For the sequential kernel, there is no overlap in any of the operations as one would expect. For the first asynchronous version of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and so forth. This is why we do not see any speedup when using the first asynchronous version on the C1060: tasks were issued to the copy engine in an order that precludes any overlap of kernel execution and data transfer. For versions two and three, however, where all the host-to-device transfers are issued before any of the device-to-host transfers, overlap is possible as indicated by the lower execution time. From our schematic, we would expect the execution of versions two and three to be 8/12 of the sequential version, or 8.7 ms, which is what is observed in the timing in the code.
在 C2050 上,两个功能相互作用会导致与 C1060 上观察到的行为不同的行为。除了单个内核引擎之外,C2050 还具有两个复制引擎,一个用于主机到设备的传输,另一个用于设备到主机的传输。拥有两个复制引擎解释了为什么第一个异步版本在 C2050 上实现了良好的加速:流 (i)中数据的设备到主机传输不会像 C1060 上那样阻止流(i+1)中数据的主机到设备传输,因为这两个操作位于 C2050 上的不同引擎中,其示意图如图17.2所示。
On the C2050, two features interact to cause different behavior than that observed on the C1060. The C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, in addition to a single kernel engine. Having two copy engines explains why the first asynchronous version achieves good speedup on the C2050: the device-to-host transfer of data in stream(i) does not block the host-to-device transfer of data in stream(i+1) as it did on the C1060 because these two operations are in different engines on the C2050, which is schematically shown in Figure 17.2.
图 17.2当有两个复制引擎时,顺序版本和异步版本的数据传输和内核执行时序。
Figure 17.2 Data transfer and kernel execution timing for the sequential and asynchronous versions when there are two copy engines.
从原理图中,我们预计执行时间相对于顺序版本会减少一半,这大致与代码中的计时观察到的情况相同。然而,这并不能解释第二种异步方法中观察到的性能下降,这与 C2050 对并发运行多个内核的支持有关。当多个内核连续发出时,调度程序会尝试启用这些内核的并发执行,从而延迟通常在每个内核完成后发生的信号(并负责启动设备到主机的传输) )直到所有内核完成。因此,虽然在异步代码的第二个版本中主机到设备的传输和内核执行之间存在重叠,但内核执行和设备到主机的传输之间没有重叠。从图 17.2中,我们预计第二个异步版本的总时间是顺序版本时间的 9/12,即 7.5 毫秒,这是我们从代码中的计时中观察到的。这种情况可以通过在每个内核之间记录虚拟 CUDA 事件来纠正,这将抑制并发内核执行,但将启用数据传输和内核执行的重叠,如第三个异步版本中所做的那样。
From the schematic we would expect the execution time to be cut in half relative to the sequential version, which is roughly what is observed in the timings in the code. This does not explain the performance degradation observed in the second asynchronous approach, however, which is related to the C2050’s support to concurrently run multiple kernels. When multiple kernels are issued back-to-back, the scheduler tries to enable concurrent execution of these kernels, and as a result delays a signal that normally occurs after each kernel completion (and is responsible for kicking off the device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers. From Figure 17.2 one would expect an overall time for the second asynchronous version to be 9/12 of the time for the sequential version, or 7.5 ms, which is what we observe from the timings in the code. This situation can be rectified by recording a dummy CUDA event between each kernel, which will inhibit concurrent kernel execution but will enable overlap of data transfers and kernel execution, as is done in the third asynchronous version.
CUDA FORTRAN 代码使用 PGI FORTRAN 编译器进行编译。具有.cuf或.CUF扩展名的文件会自动启用 CUDA FORTRAN,并且在编译具有其他扩展名的文件时可以使用编译器选项-Mcuda来启用 CUDA FORTRAN。 CUDA FORTRAN 代码的编译可以像发出命令一样简单
CUDA FORTRAN codes are compiled using PGI FORTRAN compiler. Files with the .cuf or .CUF extensions have CUDA FORTRAN enabled automatically, and the compiler option -Mcuda can be used when compiling a file with other extensions to enable CUDA FORTRAN. Compilation of CUDA FORTRAN code can be as simple as issuing the command
pgf90 saxpy.cuf
pgf90 saxpy.cuf
在幕后,发生了一个多步骤的过程。第一步是源到源编译,其中 CUDA C 设备代码由 CUDA FORTRAN 生成。从那里开始,编译类似于 CUDA C 的编译。设备代码被编译为中间表示 PTX,然后 PTX 代码进一步编译为特定计算能力的可执行代码。主机代码是使用pgFORTRAN编译的。最终的可执行文件包含主机二进制文件、设备二进制文件和 PTX。包含 PTX,以便当可执行文件在计算能力与最初编译时不同的卡上运行时,可以创建新的设备二进制文件。
Behind the scenes, a multistep process takes place. The first step is a source-to-source compilation where CUDA C device code is generated by CUDA FORTRAN. From there, compilation is similar to compilation of CUDA C. The device code is compiled into the intermediate representation PTX, and the PTX code is then further compiled to an executable code for a particular compute capability. The host code is compiled using pgFORTRAN. The final executable contains the host binary, the device binary, and the PTX. The PTX is included so that a new device binary can be created when the executable is run on a card of different compute capability than originally compiled for.
这个编译过程的细节可以通过-Mcuda的选项来控制。可以针对特定的计算能力,例如,-Mcuda=cc20为计算能力 2.0 的设备生成可执行文件。有一种模拟模式,其中设备代码在主机上运行,由-Mcuda=emu指定。可以指定 CUDA 工具包的具体版本,例如-Mcuda=cuda4.0会导致使用 4.0 CUDA 工具包进行编译。 CUDA 有一组快速但不太准确的单精度函数(如sin()和cos() )内在函数,可以通过-Mcuda=fastmath选项启用。使用这些函数不需要更改 CUDA FORTRAN 源代码,因为中间 CUDA C 代码将分别使用相应的__sinf()和__cosf()函数生成。为了更精细(选择性)控制,当在设备代码中使用cudadevice模块时,可以使用后面的版本。选项-Mucda=maxregcount:N可用于将每个线程使用的寄存器数量限制为N。选项-Mcuda=ptxinfo打印有关内核中内存使用情况的信息。可以在逗号分隔的列表中给出-Mcuda的多个选项,例如-Mcuda=cc20,cuda4.0,ptxinfo。
Specifics of this compilation process can be controlled through options to -Mcuda. A specific compute capability can be targeted, for example, -Mcuda=cc20 generates executables for devices of compute capability 2.0. There is an emulation mode where device code is run on the host, specified by -Mcuda=emu. The specific version of the CUDA toolkit can be specified, for example, -Mcuda=cuda4.0 causes compilation with the 4.0 CUDA toolkit. CUDA has a set of fast, but less accurate, intrinsics for single-precision functions like sin() and cos(), which can be enabled by the -Mcuda=fastmath option. Use of these functions requires no change in the CUDA FORTRAN source code, as the intermediate CUDA C code will be generated with the corresponding __sinf() and __cosf() functions, respectively. For finer (selective) control, the latter versions are available when the cudadevice module is used in the device code. The option -Mucda=maxregcount:N can be used to limit the number of registers used per thread to N. And the option -Mcuda=ptxinfo prints information on memory usage in kernels. Multiple options to -Mcuda can be given in a comma-separated list, for example, -Mcuda=cc20,cuda4.0,ptxinfo.
可以使用 CUDA C 中使用的命令行分析工具来分析 CUDA FORTRAN 代码。将环境变量COMPUTE_PROFILE设置为 1,
Profiling CUDA FORTRAN codes can be performed using the command-line profiling facility used in CUDA C. Setting the environment variable COMPUTE_PROFILE to 1,
% 导出 COMPUTE_PROFILE=1
% export COMPUTE_PROFILE=1
执行代码会生成一个分析结果文件,默认为cuda_profile_0.log。有关命令行分析器的使用,请参阅随 CUDA 工具包分发的文档。
and executing the code generates a file of profiling results, by default cuda_profile_0.log. For use of the command-line profiler see the documentation distributed with the CUDA toolkit.
之前,我们演示了使用iso_c_binding模块从 CUDA FORTRAN 调用外部 CUDA C 库,特别是 CUBLAS 库。在本节中,我们将演示 CUDA FORTRAN 如何与 Thrust(第 16 章中讨论的 GPU 标准模板库)交互。相对于调用 CUDA C 函数,与 Thrust 交互需要创建访问 Thrust 设备容器的 C 指针的额外步骤,如以下代码段所示:
Previously, we demonstrated calling external CUDA C libraries from CUDA FORTRAN, in particular the CUBLAS library, using the iso_c_binding module. In this section we demonstrate how CUDA FORTRAN can interface with Thrust, the standard template library for the GPU discussed in Chapter 16. Relative to calling CUDA C functions, interfacing with Thrust requires the additional step of creating C pointers that access the Thrust device containers, such as in the following code segment:
// 分配设备向量
// allocate device vector
推力::device_vector d_vec(4);
thrust::device_vector d_vec(4);
// 获取指向设备向量内存的原始指针
// obtain raw pointer to device vector’s memory
int *ptr = 推力::raw_pointer_cast(&d_vec[0]);
int ∗ptr = thrust::raw_pointer_cast(&d_vec[0]);
Thrust 与 CUDA FORTRAN 接口的基本过程是创建 C 包装函数,通过标准 C 指针访问 Thrust 的函数,然后使用iso_c_binding模块通过 CUDA FORTRAN 中的通用接口访问这些函数。例如,我们使用 Thrust 的排序例程。int、float和double 排序例程的包装函数如下:
The basic procedure to interface Thrust with CUDA FORTRAN is to create C wrapper functions that access Thrust’s functions through standard C pointers, and then use the iso_c_binding module to access these functions through a generic interface in CUDA FORTRAN. For an example, we use Thrust’s sort routine. The wrapper functions for the int, float, and double sort routines are as follows:
// 文件名:csort.cu
// Filename: csort.cu
// nvcc -c -arch sm_20 csort.cu
// nvcc -c -arch sm_20 csort.cu
#include <推力/device_vector.h>
#include <thrust/device_vector.h>
#include <推力/device_vector.h>
#include <thrust/device_vector.h>
#include <推力/sort.h>
#include <thrust/sort.h>
外部“C”{
extern "C" {
//对整数数组进行排序
//Sort for integer arrays
void sort_int_wrapper( int *data, int N)
void sort_int_wrapper( int ∗data, int N)
{
{
// Wrap raw pointer with a device_ptr
推力::device_ptr <int> dev_ptr(数据);
thrust::device_ptr <int> dev_ptr(data);
// 在推力排序算法中使用device_ptr
// Use device_ptr in Thrust sort algorithm
推力::排序(dev_ptr, dev_ptr+N);
thrust::sort(dev_ptr, dev_ptr+N);
}
}
//对浮点数组进行排序
//Sort for float arrays
void sort_float_wrapper( float *data, int N)
void sort_float_wrapper( float ∗data, int N)
{
{
推力::device_ptr <浮点> dev_ptr(数据);
thrust::device_ptr <float> dev_ptr(data);
推力::排序(dev_ptr, dev_ptr+N);
thrust::sort(dev_ptr, dev_ptr+N);
}
}
//对双精度数组进行排序
//Sort for double arrays
void sort_double_wrapper( 双 *data, int N)
void sort_double_wrapper( double ∗data, int N)
{
{
推力::device_ptr <double> dev_ptr(数据);
thrust::device_ptr <double> dev_ptr(data);
推力::排序(dev_ptr, dev_ptr+N);
thrust::sort(dev_ptr, dev_ptr+N);
}
}
}
}
使用编译代码
Compiling the code using
nvcc -c -arch sm_20 csort.cu
nvcc -c -arch sm_20 csort.cu
将生成一个目标文件csort.o,我们稍后将在 CUDA FORTRAN 代码的链接阶段使用该文件。
will generate an object file, csort.o, that we will use later on in the linking stage of the CUDA FORTRAN code.
有了可用的 C 包装器函数,我们现在可以编写一个 FORTRAN 模块,该模块具有对 Sucht排序功能的通用接口:
With the C wrapper functions available, we can now write a FORTRAN module with a generic interface to Thust’s sort functionality:
模块推力
module thrust
接口推力排序
interface thrustsort
子例程 sort_int(input,N) bind(C,name="sort_int_wrapper")
subroutine sort_int(input,N) bind(C,name="sort_int_wrapper")
使用 iso_c_binding
use iso_c_binding
整数(c_int),设备::输入(*)
integer(c_int),device:: input(∗)
整数(c_int),值:: N
integer(c_int),value:: N
结束子程序 sort_int
end subroutine sort_int
子例程 sort_float(input,N) bind(C,name="sort_float_wrapper")
subroutine sort_float(input,N) bind(C,name="sort_float_wrapper")
使用 iso_c_binding
use iso_c_binding
real(c_float),设备::输入(*)
real(c_float),device:: input(∗)
整数(c_int),值:: N
integer(c_int),value:: N
结束子程序 sort_float
end subroutine sort_float
子例程 sort_double(input,N) bind(C,name="sort_double_wrapper")
subroutine sort_double(input,N) bind(C,name="sort_double_wrapper")
使用 iso_c_binding
use iso_c_binding
real(c_double),设备::输入(*)
real(c_double),device:: input(∗)
整数(c_int),值:: N
integer(c_int),value:: N
结束子程序 sort_double
end subroutine sort_double
终端接口推力排序
end interface thrustsort
编写好 C 包装函数和 FORTRAN 模块后,我们现在可以转向主要的 FORTRAN 代码,该代码生成数据并将其传输到设备,调用排序函数,并将数据传输回主机:
With the C wrapper functions and the FORTRAN module written, we can now turn to the main FORTRAN code that generates and transfers the data to the device, calls the sort functions, and transfers the data back to the host:
程序测试排序
program testsort
使用推力
use thrust
!声明两个数组,一个在 CPU 上(cpuData),一个在 GPU 上(gpuData)
! Declare two arrays, one on CPU (cpuData), one on GPU (gpuData)
真实的、可分配的:: cpuData(:)
real, allocatable :: cpuData(:)
真实的、可分配的、设备 :: gpuData(:)
real, allocatable, device :: gpuData(:)
整数:: N=10
integer:: N=10
!使用标准分配来分配数组
! Allocate the arrays using standard allocate
分配(cpuData(N),gpuData(N))
allocate(cpuData(N),gpuData(N))
!在CPU上生成随机数
! Generate random numbers on the CPU
我=1,N
do i=1,N
cpuData(i)=随机(i)
cpuData(i)=random(i)
结束做
end do
cpu数据(5)=100。
cpuData(5)=100.
print *,"排序前", cpuData
print ∗,"Before sorting", cpuData
!通过简单的分配将数据复制到 GPU
! Copy the data to GPU with a simple assignment
GPU数据=CPU数据
gpuData=cpuData
!调用 Thrust 排序函数。通用接口将
! Call the Thrust sorting function. The generic interface will
!选择正确的例程,在本例中是在浮子上运行的例程
! select the proper routine, in this case the one operating on floats
调用 Thrustsort(gpuData,size(gpuData))
call thrustsort(gpuData,size(gpuData))
!通过简单的赋值将数据复制回CPU
! Copy the data back to CPU with a simple assignment
cpu数据=GPU数据
cpuData=gpuData
print *,"排序后", cpuData
print ∗,"After sorting", cpuData
!使用标准释放来释放数组
! Deallocate the arrays using standard deallocate
分配(cpuData(N),gpuData(N))
allocate(cpuData(N),gpuData(N))
结束程序测试排序
end program testsort
如果我们将模块保存在文件mod_thrust.cuf中并将程序保存在simplesort.cuf中,我们就可以编译和执行:
If we save the module in a file mod_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:
$ pgf90 -Mcuda=cc20 -O3 -o simple_sort mod_thrust.cuf simple_sort.cuf csort.o
$ pgf90 -Mcuda=cc20 -O3 -o simple_sort mod_thrust.cuf simple_sort.cuf csort.o
$ ./简单排序
$ ./simple_sort
排序前 4.1630346E-02 0.9124327 0.7832350 0.6540373
Before sorting 4.1630346E-02 0.9124327 0.7832350 0.6540373
100.0000 0.3956419 0.2664442 0.1372465
100.0000 0.3956419 0.2664442 0.1372465
排序后 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442
After sorting 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442
0.3956419 0.6540373 0.7832350 0.8788511
0.3956419 0.6540373 0.7832350 0.8788511
0.9124327 100.0000
0.9124327 100.0000
我们可以修改主要代码以使用 CUDA 事件 API 评估性能,如下所示:
We can modify the main code to evaluate the performance using the CUDA event API as follows:
节目时间排序
program timesort
使用cudafor
use cudafor
使用推力
use thrust
隐式无
implicit none
真实的、可分配的:: cpuData(:)
real, allocatable :: cpuData(:)
真实的、可分配的、设备 :: gpuData(:)
real, allocatable, device :: gpuData(:)
整数:: i,N=100000000
integer:: i,N=100000000
!经过时间的 CUDA 事件
! CUDA events for elapsing time
类型 (cudaEvent):: startEvent , stopEvent
type (cudaEvent):: startEvent , stopEvent
真实::时间,随机
real:: time, random
整数:: istat
integer:: istat
!创建事件
! Create events
istat = cudaEventCreate(startEvent)
istat = cudaEventCreate(startEvent)
istat = cudaEventCreate(stopEvent)
istat = cudaEventCreate(stopEvent)
!分配数组
! Allocate arrays
分配(cpuData(N),gpuData(N))
allocate(cpuData(N),gpuData(N))
我=1,N
do i=1,N
cpuData(i)=随机(i)
cpuData(i)=random(i)
结束做
end do
print *,"排序数组",N,"单精度"
print ∗,"Sorting array of ",N, " single precision"
GPU数据=CPU数据
gpuData=cpuData
istat = cudaEventRecord ( 开始事件 , 0)
istat = cudaEventRecord ( startEvent , 0)
调用 Thrustsort(gpuData,size(gpuData))
call thrustsort(gpuData,size(gpuData))
istat = cudaEventRecord ( 停止事件 , 0)
istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( 停止事件 )
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( 时间 , 开始事件 , 停止事件 )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )
cpu数据=GPU数据
cpuData=gpuData
print *," 排序后的数组:",time," (ms)"
print ∗," Sorted array in:",time," (ms)"
!打印前五个元素和后五个元素。
!Print the first five elements and the last five.
print *,"排序后", cpuData(1:5),cpuData(N-4:N)
print ∗,"After sorting", cpuData(1:5),cpuData(N-4:N)
对于 CUDA 事件,我们仅对排序内核的执行时间进行计时。当数据驻留在 GPU 内存中时,我们可以在启用 ECC 的 Tesla M2050 上在 0.222 秒内对 100 M 元素的向量进行排序:
With the CUDA events, we are timing only the execution time of the sorting kernel. We can sort a vector of 100 M elements in 0.222 second on a Tesla M2050 with ECC on when the data is resident in GPU memory:
$ pgf90 -Mcuda=cc20 -O3 -o time_sort mod_thrust.cuf time_sort.cuf csort.o
$ pgf90 -Mcuda=cc20 -O3 -o time_sort mod_thrust.cuf time_sort.cuf csort.o
$ ./time_sort
$ ./time_sort
100000000单精度排序数组
Sorting array of 100000000 single precision
排序数组:222.1711(毫秒)
Sorted array in: 222.1711 (ms)
排序后 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 1.000000 1.000000 1.000000
4.4078664E-08 0.9999999 0.9999999 1.000000 1.000000 1.000000
18.1 核心 C++ Amp 功能
18.1 Core C++ Amp Features
18.2 C++ AMP 执行模型的详细信息
18.2 Details of the C++ AMP Execution Model
18.3 管理加速器
18.3 Managing Accelerators
18.4 平铺执行
18.4 Tiled Execution
18.5 C++ AMP 图形功能
18.5 C++ AMP Graphics Features
18.6 概括
18.6 Summary
18.7 练习
18.7 Exercises
C++ 加速大规模并行(C++ AMP)是一种编程模型,用于表达数据并行算法并使用主流工具开发异构计算机。 C++ AMP 旨在提供生产力、可移植性和性能。 C++ AMP 最初由 Microsoft 开发,由开放规范定义,该规范接受多个来源的输入,包括 AMD 和 NVIDIA。在本章中,我们概述了 C++ AMP。
C++ Accelerated Massive Parallelism, or C++ AMP, is a programming model for expressing data-parallel algorithms and exploiting heterogeneous computers using mainstream tools. C++ AMP was designed to offer productivity, portability, and performance. Developed initially by Microsoft, C++ AMP is defined by an open specification which takes input from multiple sources, including from AMD and NVIDIA. In this chapter we provide an overview of C++ AMP.
C++ AMP 的重点是表达重要的数据并行算法模式,同时提供最少的新语言功能,并使常见场景免受当今复杂的 GPU 编程的影响。这为使用 C++ AMP 编写的应用程序在各种不同硬件上的可移植性奠定了基础。随着硬件的不断发展,这种可移植性创造了面向未来的保护投资,并提高了代码在不同设备和不同制造商之间的可重用性。同时,完整的 C++ AMP 功能集包括在必须解决系统复杂性时实现性能的高级机制。在本章中,我们首先讨论最C++ AMP 的简单示例,然后我们更轻松地讨论这些高级功能。
The focus of C++ AMP is to express the important data-parallel algorithm pattern while providing minimum new language features and shielding common scenarios from the intricacies of today’s GPU programming. This provides a foundation of portability for applications written in C++ AMP across a range of different hardware. This portability creates future-proofing to preserve investment as hardware continues to evolve, as well as improving reusability of code across different devices and different manufacturers. At the same time, the full C++ AMP feature set includes advanced mechanisms for achieving performance when system intricacies must be addressed. In this chapter, we discuss first the most straightforward examples of C++ AMP, and then we more lightly address these advanced features.
C++ AMP 是当前 C++ 11 标准的小型扩展,并且依赖于该标准的一些核心功能。特别是,我们假设读者熟悉现代 C++,包括使用 lambda 表达式构建函数闭包、使用模板进行类型泛型编程、使用命名空间来控制名称的可见性以及标准模板库( STL)。常见模式很简单,因此深入理解并不是使用 C++ AMP 的先决条件。与 CUDA 和 OpenCL 不同,C++ AMP 允许在数据并行计算中使用 C++ 的丰富子集,以及将 C++ 用于主机。 C++ AMP 具有与 C++ 相同的基本编译模型,具有用于接口规范的头文件和组合成单个可执行文件的单独编译单元。
C++ AMP is a small extension to the current C++ 11 standard and is dependent on some of the core features of that standard. In particular, we will assume readers are familiar with modern C++, including the use of lambda expressions to build function closures, the use of templates for type-generic programming, the use of namespaces to control visibility of names, and the standard template library (STL). The common patterns are simple, so a deep understanding is not a prerequisite to use C++ AMP. Unlike CUDA and OpenCL, C++ AMP allows a rich subset of C++ inside data-parallel computations as well as using C++ for the host. C++ AMP has the same base compilation model as C++ with header files for interface specification and separate compilation units combined into a single executable.
C++ AMP 确实依赖于该语言的两个扩展。第一个对可在函数体中使用的 C++ 操作进行限制,第二个支持数据并行内核内有限的跨线程数据共享的形式。这两者都将在第 18.1 节中进行说明。 C++ AMP 的所有其他方面都作为通过几个头文件访问的库提供。
C++ AMP does rely on two extensions to the language. The first places restrictions on the C++ operations that may be used in bodies of functions, and the second supports a form of limited cross-thread data sharing within data-parallel kernels. Both of these will be illustrated in Section 18.1. All other aspects of C++ AMP are delivered as a library accessed via a few header files.
C++ AMP 与 CUDA 共享许多概念。在下文中,我们将通过显示前面章节中 CUDA 示例的 C++ AMP 等效项来说明这一点。 C++ AMP 术语与 CUDA 有一些细微的差别,我们将在出现这些差异时强调这些差异。
C++ AMP shares many concepts with CUDA. In the following text we will illustrate this by showing C++ AMP equivalents for CUDA examples from earlier chapters. C++ AMP terminology differs from CUDA in small ways and we will highlight those differences as they arise.
我们通过将第 3 章中使用的示例从 CUDA 转换为 C++ AMP 来描述 C++ AMP 的核心功能。图 18.1是使用 CUDA 设备对主机向量执行向量加法的 CUDA 代码。
We describe the core features of C++ AMP by translating an example used in Chapter 3 from CUDA into C++ AMP. Figure 18.1 is the CUDA code for performing vector addition on host vectors using a CUDA device.
相应的C++ AMP代码如图18.2所示。第 1 行包含 C++ AMP 头文件amp.h,它提供了核心功能的声明。 C++ AMP 类和函数是并发命名空间的一部分。下一行的using指令使 C++ AMP 名称在当前作用域中可见。它是可选的,但无需在 C++ AMP 名称前添加concurrency::范围说明符。
The corresponding C++ AMP code is shown in Figure 18.2. Line 1 includes the C++ AMP header, amp.h, which provides the declarations of the core features. The C++ AMP classes and functions are part of the concurrency namespace. The using directive on the next line makes the C++ AMP names visible in the current scope. It is optional but avoids the need to prefix C++ AMP names with a concurrency:: scope specifier.
图 18.2 C++ AMP 中的向量加法。
Figure 18.2 Vector addition in C++ AMP.
图 18.2中第 4 行的函数vecAdd与图 18.1中第 6 行开始的函数在功能上相同。这个函数是由主机上运行的线程执行,它包含可以加速的数据并行计算。术语主机在 C++ AMP 文档中与 CUDA 中具有相同的含义。虽然 CUDA 使用术语设备来指代用于加速的执行环境执行时,C++ AMP 使用术语加速器,这将在第 18.3 节中详细讨论。
The function vecAdd on line 4 in Figure 18.2 is functionally identical to the same function starting in line 6 in Figure 18.1. This function is executed by a thread running on the host and it contains a data-parallel computation that may be accelerated. The term host has the same meaning in C++ AMP documentation as in CUDA. While CUDA uses the term device to refer to the execution environment used for accelerated execution, C++ AMP uses the term accelerator, which is discussed more in Section 18.3.
在 C++ AMP 中,读取和写入大型数据集合的主要工具是类模板array_view。 array_view提供对数据位置的矩形集合的多维引用。这不是数据的新副本,而是访问现有内存位置的新方法。该模板有两个参数:源数据元素的类型和指示 array_view 维数的整数。在整个 C++ AMP 中,指示维度的模板参数被称为类型或对象的等级。在此示例中,我们有一个C++ 浮点值的1D array_view(或排名 1 的array_view )。
In C++ AMP, the primary vehicle for reading and writing large data collections is the class template array_view. An array_view provides a multidimensional reference to a rectangular collection of data locations. This is not a new copy of the data but rather a new way to access the existing memory locations. The template has two parameters: the type of the elements of the source data, and an integer that indicates the dimensionality of the array_view. Throughout C++ AMP, template parameters that indicate dimensionality are referred to as the rank of the type or object. In this example, we have a 1D array_view (or an array_view of rank 1) of C++ float values.
等级 1 的数组视图的构造函数(例如图 18.2中第 7 行的CV)采用两个参数。第一个是一个整数值,它是数据元素的数量。一般来说,每维长度的集合称为范围(extent)。为了表示和操作范围,C++ AMP 提供了一个类模板,范围,以及捕获排名的单整数模板参数。对于维数较少的对象,会重载各种构造函数,以允许将范围指定为一个或多个整数值,就像CV所做的那样。CV构造函数的第二个参数是指向主机数据的指针。在vecAdd中,主机数据表示为指向连续数据的 C 风格指针。 array_view也可以覆盖 STL 容器(参见第 16.1 节) ,例如std::vector当它们支持访问底层连续存储的数据方法时。
The constructor for array views of rank 1, such as CV on line 7 in Figure 18.2, takes two parameters. The first is an integer value that is the number of data elements. In general, the set of per-dimension lengths is referred to as an extent. To represent and manipulate extents, C++ AMP provides a class template, extent, with a single-integer template parameter that captures the rank. For objects with a low number of dimensions, various constructors are overloaded to allow specification of an extent as one or more integer values as is done for CV. The second parameter to the CV constructor is a pointer to the host data. In vecAdd the host data is expressed as a C-style pointer to contiguous data. An array_view may also overlay STL containers (see Section 16.1) such as std::vector when they support a data method to access underlying contiguous storage.
CUDA 代码显式分配设备可访问的内存(图 18.1 ,第 9-13 行)并将主机数据复制到其中。通过创建array_view和主机数据之间的关联,然后通过加速器上的array_view访问数据,这些操作在 C++ AMP 中是隐式的。 array_view::discard_data方法优化了某些加速器的数据传输,将在下一节中讨论。在此示例中,当现有数据值不重要(因为它们即将被覆盖)时使用它。
The CUDA code explicitly allocates memory (Figure 18.1, lines 9–13) that is accessible by the device and copies host data into it. These actions are implicit in C++ AMP by creating the association between an array_view and host data and subsequently accessing the data through the array_view on the accelerator. The method array_view::discard_data optimizes data transfers for some accelerators and is discussed in the next section. In this example, it is used when existing data values are immaterial because they are about to be overwritten.
图 18.2中的第 9 行说明了parallel_for_each构造,它是用于数据并行计算的 C++ AMP 代码模式。这对应于 CUDA 中的内核启动(图 18.1,第 14 行)。在 CUDA 术语中(如图3.3所示),parallel_for_each创建一个线程网格。在 C++ AMP 中,执行计算的元素集称为计算域,并由范围对象定义。与 CUDA 中一样,每个线程将为每个点调用相同的函数,并且线程仅通过它们在域(网格)中的位置来区分。与 CUDA 不同,该域不需要被视为线程块数组(如图3.12所示)。 index参数组合了来自单独 CUDA 关键字blockIdx.x、blockDim.x和threadIdx.x的常见情况所需的信息。
Line 9 in Figure 18.2 illustrates the parallel_for_each construct that is the C++ AMP code pattern for a data-parallel computation. This corresponds to the kernel launch in CUDA (Figure 18.1, line 14). In CUDA terminology (as in Figure 3.3), the parallel_for_each creates a grid of threads. In C++ AMP the set of elements for which a computation is performed is called the compute domain and is defined by an extent object. Like in CUDA, each thread will invoke the same function for every point and threads are distinguished only by their location in the domain (grid). Unlike CUDA, this domain need not be treated as an array of thread blocks (as in Figure 3.12). The index parameter combines information needed for common cases from the separate CUDA keyword blockIdx.x, blockDim.x, and threadIdx.x.
与标准 C++ STL 算法for_each类似,parallel_for_each函数模板指定要应用于值集合的函数。parallel_for_each的第一个参数是一个 C++ AMP 范围对象,用于描述执行数据并行计算的域。在此示例中,我们对array_view中的每个元素执行操作,因此传递到parallel_for_each的范围是CV数组视图的范围。在示例中,这是通过array_view类型的范围属性访问的。这是一维范围,计算域由整数值 0..( n − 1)组成。
Similar to the standard C++ STL algorithm for_each, the parallel_for_each function template specifies a function to be applied to a collection of values. The first argument to a parallel_for_each is a C++ AMP extent object that describes the domain over which a data-parallel computation is performed. In this example, we perform an operation over every element in an array_view and so the extent passed into the parallel_for_each is the extent of the CV array view. In the example, this is accessed through the extent property of the array_view type. This is a 1D extent and the domain of the computation consists of integer values 0..(n − 1).
parallel_for_each的第二个参数是 C++ 函数对象(或函子)。在这些示例中,我们使用 C++ 11 lambda 语法作为构建此类对象的便捷方法。parallel_for_each的核心语义是为范围参数定义的计算域中的每个元素精确调用第二个参数定义的函数一次。
The second argument to a parallel_for_each is a C++ function object (or functor). In these examples we use the C++ 11 lambda syntax as a convenient way to build such an object. The core semantics of a parallel_for_each is to invoke the function defined by the second parameter exactly once for every element in the compute domain defined by the extent argument.
前导[=]表示在包含函数内部声明但在 lambda 内部引用的变量将被“捕获”并复制到为 lambda 构建的函数对象的数据成员中。在本例中,这将是三个array_view对象。调用的函数有一个参数,该参数被初始化为计算域内线程的位置。这再次由类模板index表示,它表示整数值的短向量。索引的秩是该向量的长度,并且与范围的秩相同。索引参数传达与CUDA 代码中显式计算值i相同的信息(参见图 18.1,第 3 行)。这些索引值可用于选择数组视图中的元素,如图18.2第 11 行所示。
The leading [=] indicates that variables declared inside the containing function but referenced inside the lambda are “captured” and copied into data members of the function object built for the lambda. In this case, this will be the three array_view objects. The function invoked has a single parameter that is initialized to the location of a thread within the compute domain. This is again represented by a class template, index, which represents a short vector of integer values. The rank of an index is the length of this vector and is the same as the rank of the extent. The index parameter conveys the same information as the explicitly computed value i in the CUDA code (see Figure 18.1, line 3). These index values can be used to select elements in an array view as illustrated on line 11 of Figure 18.2.
此示例显示了 C++ 的一个关键扩展:restrict(amp)修饰符。在C++ AMP中,现有的C99关键字restrict被借用并允许在新的上下文中使用:它可以跟踪函数(包括lambda函数)的形式参数列表。然后, restrict关键字后跟一个或多个限制说明符的括号列表。虽然其他用途也是可能的,但在 C++ AMP 中仅定义了两个此类说明符:amp和cpu。
A key extension to C++ is shown in this example: the restrict(amp) modifier. In C++ AMP, the existing C99 keyword restrict is borrowed and allowed in a new context: it may trail the formal parameter list of a function (including lambda functions). The restrict keyword is then followed by a parenthesized list of one or more restriction specifiers. While other uses are possible, in C++ AMP there are only two such specifiers defined: amp and cpu.
传递给parallel_for_each的函数对象必须使用restrict(amp)规范注释其调用运算符。任何调用的函数来自该操作员的身体也必须同样受到限制。limit(amp)规范类似于CUDA 中的__device__关键字。它标识可以在硬件加速器上调用的函数。类似地,restrict(cpu)对应于CUDA __host__关键字,表示可以在主机上调用的函数。当没有指定限制时,默认为restrict(cpu)。 C++ AMP 不需要类似于 CUDA __global__关键字。一个函数可能有两个限制,restrict(cpu,amp),在这种情况下,它可以在主机或加速器上下文中调用,并且必须满足这两个上下文的限制。
The function object passed to parallel_for_each must have its call operator annotated with a restrict(amp) specification. Any function called from the body of that operator must similarly be restricted. The restrict(amp) specification is analogous to the __device__ keyword in CUDA. It identifies functions that may be invoked on a hardware accelerator. Analogously, restrict(cpu) corresponds to the CUDA __host__ keyword and indicates functions that may be invoked on the host. When no restriction is specified, the default is restrict(cpu). C++ AMP has no need for an analog to the CUDA __global__ keyword. A function may have both restrictions, restrict(cpu,amp), in which case it may be called in either host or accelerator contexts and must satisfy the restrictions of both contexts.
limit修饰符允许定义 C++ 的子集以在代码体中使用。在 C++ AMP 的第一个版本中,这些限制反映了 GPU 在用作数据并行代码加速器时的当前常见限制。该组限制包括:
The restrict modifier allows a subset of C++ to be defined for use in a body of code. In the first release of C++ AMP, the restrictions reflect current common limitations of GPUs when used as accelerators of data-parallel code. The set of restrictions includes:
• 不能引用全局变量或静态变量,除非它们具有const类型限定并且可以简化为仅用作右值的整数文字值。
• No reference may be made to global or static variables except when they have a const type qualification and can be reduced to an integer literal value that is only used as an rvalue.
• parallel_for_each中使用的 lambda 表达式必须按值捕获大多数变量,但 C++ AMP数组和纹理对象除外(稍后将分别介绍)。
• A lambda expression used in a parallel_for_each must capture most variables by value with the exception of C++ AMP array and texture objects, each described later.
• Targets of function calls may not be virtual methods, pointers to functions, or pointers to member functions.
• Functions may not be recursively invoked and must be inlineable.
• 只有bool、int、unsigned int、long、unsigned long、float、double和void可以用作 C++ 基元类型。
• Only bool, int, unsigned int, long, unsigned long, float, double, and void may be used as C++ primitive types.
• C++ 复合用户定义类型通常是允许的,但可能没有虚拟基类或位字段,并且所有数据成员和基类必须是 4 字节对齐。
• C++ compound user-defined types are generally permitted but may not have virtual base classes or bit fields, and all data members and base classes must be 4-byte aligned.
• No use of dynamic_cast or typeid is permitted.
• No use of goto statements is permitted.
这些限制反映了当今广泛使用的基于 GPU 的加速器的一组常见限制。随着时间的推移,我们预计这些限制将被取消,并且 C++ AMP 的开放规范包括限制较少的未来版本的可能路线图。当然, restrict (cpu)说明符允许 C++ 的所有功能,但是,由于 C++ AMP 一部分的某些函数是特定于加速器的,因此它们没有limit(cpu)版本,因此它们只能用于限制(amp)代码。
These restrictions reflect a common set of limitations for the GPU-based accelerators broadly available today. Over time we expect these restrictions to be lifted, and the open specification for C++ AMP includes a possible roadmap of future versions that are less restrictive. The restrict(cpu) specifier, of course, permits all of the capabilities of C++ but, because some functions that are part of C++ AMP are accelerator-specific, they do not have restrict(cpu) versions and so they may only be used in restrict(amp) code.
函数的限制说明符是函数类型的一部分,当它们具有不同的限制时,函数名称可能会重载。因此,两个函数可能具有相同的签名,除了一个具有restrict(amp)规范而另一个具有restrict(cpu)规范之外。这允许创建特定于上下文的函数实现。具有两个重载(每个上下文一个)的函数可以从restrict(amp,cpu)函数调用,并且将调用与该函数是在主机上还是在加速器上调用相对应的适当重载。特别是,此功能在 C++ AMP 中使用,以允许特定于上下文的数学运算实现,但它也可供应用程序和库开发人员使用。
The restriction specifiers for a function are part of the type of the function, and function names may be overloaded when they have different restrictions. Thus, two functions may have identical signatures except one has the restrict(amp) specification and the other has the restrict(cpu) specification. This allows context-specific implementations of functions to be created. A function that has two overloads, one for each context, may be called from a restrict(amp,cpu) function, and the appropriate overload will be invoked that corresponds to whether the function is being invoked on the host or on an accelerator. In particular, this capability is used within C++ AMP to allow context-specific implementations of mathematic operations, but it is also available to application and library developers.
在restrict(amp) lambda的主体内(图18.2,第10-12 行),存在对包含范围中声明的array_view对象的引用。这些被“捕获”到为实现 lambda 而创建的函数对象中。函数作用域中的其他变量也可以通过值捕获。这些其他值中的每一个都可用于加速器上执行的函数的每次调用。对于任何 C++ 11 不可变 lambda,通过值捕获的变量不能在 lambda 主体中修改。但是, array_view的元素可能会被修改,并且这些修改将反映回主机。在此示例中,在函数vecAdd返回之前, parallel_for_each内对CV所做的任何更改都将反映在主机数据C中。
Inside the body of the restrict(amp) lambda (Figure 18.2, lines 10-12), there are references to the array_view objects declared in the containing scope. These are “captured” into the function object that is created to implement the lambda. Other variables from the function scope may also be captured by value. Each of these other values is made available to each invocation of the function executed on the accelerator. As for any C++ 11 nonmutable lambda, variables captured by value may not be modified in the body of the lambda. However, the elements of an array_view may be modified and those modifications will be reflected back to the host. In this example, any changes to CV made inside the parallel_for_each will be reflected in the host data C before the function vecAdd returns.
图 18.2中第 13 行的最后一条语句使用array_view::synchronize方法来确保底层主机数据结构随着任何更改而更新。这也会在下一节中讨论。如果主机通过数组视图CV访问数据,则不需要此操作,但需要通过主机指针C可靠地访问数据。array_view的中心目的是允许对来自主机和加速器的数据进行一致访问,而不需要显式同步或数据副本。
The final statement on line 13 in Figure 18.2 uses the array_view::synchronize method to ensure the underlying host data structure is updated with any changes. This is also discussed in the next section. This operation is not needed if the host accesses the data through the array view CV, but is needed to reliably access the data through the host pointer C. The central purpose of the array_view is to allow coherent access to data from both the host and the accelerator without the need for explicit synchronization or data copies.
图 18.3是一个更复杂的例子,借用第 12 章和图 12.3。它对 3D 数据结构的切片执行计算。我们用它来说明高维array_view对象和计算域的处理。该函数接口本质上与源代码相同,其中 CUDA dim3类型被替换为网格参数的C++ AMP范围<3>。energygrid指向的连续数据覆盖有 3D array_view(名为energygrid_view)。 C++ AMP 遵循行主存储布局,因此编号越高的维度越少在线性存储顺序中很重要。 C++ AMP 具有创建array_view的机制,该 array_view 是另一个array_view的一部分,并且还可以向下投影以选择较低维度的切片。图 18.3第 6 行使用此操作来选择内核实际定义的数据部分。和之前一样,我们使用discard_data方法来避免将不重要的现有值复制到GPU。我们用名为atom_view的2D array_view覆盖原子数据,以简化访问的表达。这并没有从根本上改变实际寻址算法的执行方式,但似乎更准确地对问题进行建模。
Figure 18.3 is a more complex example borrowed from Chapter 12 and Figure 12.3. It performs a calculation on a slice of a 3D data structure. We use it to illustrate the handing of higher-dimensional array_view objects and compute domains. The function interface is essentially identical to the source where the CUDA dim3 type is replaced with a C++ AMP extent<3> for the grid parameter. The contiguous data pointed to by energygrid is overlaid with a 3D array_view (named energygrid_view). C++ AMP follows a row-major storage layout so higher-numbered dimensions are less significant in the linear storage order. C++ AMP has mechanisms to create an array_view that is a section of another array_view and also to project down to select a lower-dimensional slice. This operation is used on line 6 of Figure 18.3 to select the portion of the data actually defined by the kernel. As before, we use the discard_data method to avoid copying the immaterial existing values to the GPU. We overlay the atoms data with the 2D array_view named atom_view to simplify the expression of the accesses. This does not fundamentally change how the actual addressing arithmetic is performed, but seems to model the problem more accurately.
图 18.3 2D 切片的基础库仑势计算代码。
Figure 18.3 Base coulomb potential calculation code for a 2D slice.
然后,数据并行计算在片的范围内进行,其中原始顺序循环索引j和i被转换为索引 ji。除了atom_view的索引和energy_slice的索引之外,循环体基本上没有改变。
The data-parallel computation is then over the extent of the slice where the original sequential loop indices j and i are translated into the index<2> ji. Except for the indexing of atom_view, and the indexing into energy_slice, the body of the loop is largely unchanged.
C++ AMP 提供了一组基本数学运算,用于在restrict(amp)上下文中使用。这些函数可以通过包含amp_math.h(未显示)来访问。 concurrency ::fast_math和concurrency::precise_math名称空间分别声明更快和更精确的函数版本。在示例中,我们选择使用Precision_math::sqrtf用于说明。在restrict(cpu)代码中,这两个命名空间都为这些函数的std::实现建立了别名,因此声明为restrict(cpu,amp)的函数仍然可以引用数学函数并获得目标的最佳实现。
C++ AMP provides a set of basic math operations for use in restrict(amp) contexts. These functions are accessed by including amp_math.h (which is not shown). The concurrency::fast_math and concurrency::precise_math names spaces respectively declare faster and more precise versions of functions. In the example, we chose to use precise_math::sqrtf for illustration. In restrict(cpu) code, both of these namespaces establish aliases to std:: implementations of these functions, so a function that is declared restrict(cpu,amp) can still reference math functions and get the best implementation for the target.
总结本节,核心 C++ AMP 概念包括array_view,它提供矩形数据的多维视图;一个范围,它是这种视图的形状,也是数据并行计算的形状;索引,用于选择array_view或数据并行计算的元素; parallel_for_each ,启动数据并行计算;和restrict(amp)修改函数,这些函数在该计算中的每个点进行评估。
To summarize this section, the core C++ AMP concepts include an array_view, which provides a multidimensional view into rectangular data; an extent, which is the shape of such a view and also the shape of a data-parallel computation; an index, which is used to select elements of an array_view or a data-parallel computation; the parallel_for_each, which launches a data-parallel computation; and restrict(amp) modified functions, which are evaluated at each point in that computation.
上一节提到的核心 C++ AMP 功能主要是将数据并行性表示为对访问多维数据数组的线程集合的并发调用。如今许多加速器在单独的内存中运行,无法直接访问主机数据。此外,这些加速器与主机代码的持续执行同时运行。在最大限度地减少这些问题的影响的同时,这些方面也是 C++ AMP 执行模型的一部分。
The core C++ AMP features noted in the previous section focus on expressing data parallelism essentially as a concurrent invocation of a collection of threads that access multidimensional arrays of data. Many accelerators today run in a separate memory and cannot directly access host data. Furthermore, these accelerators run concurrently with the continuing execution of host code. While minimizing the impact of these concerns, these aspects are part of the execution model of C++ AMP.
C++ AMP 提供类模板数组来在加速器上分配存储。与array_view类似并具有几乎相同的接口,数组具有元素类型和排名模板参数。构造函数包含范围信息。与array_view不同,数组在加速器上分配新的存储空间。阵列的数据元素只能从该加速器访问,并且在阵列和主机存储器之间复制数据的所有操作都是显式的。
C++ AMP provides the class template array to allocate storage on an accelerator. Similar to an array_view and with a nearly identical interface, an array has element type and rank template parameters. The constructor includes extent information. Unlike an array_view, an array allocates new storage on an accelerator. The data elements of an array may only be accessed from that accelerator and all operations that copy data between an array and host memory are explicit.
为了说明这一点,请考虑图 18.4,它重写了图 18.2以使用显式数组操作。每个array_view都被替换为相同范围的数组声明。第 5 行和第 6 行显示使用 C++ AMP复制函数模板从主机数据到数组的显式复制。 lambda 略有更改,以通过引用捕获数组变量,而不是像其他示例中那样通过值捕获变量的默认模式。 C++ AMP数组对象必须通过引用捕获,而array_view对象必须通过在 a 中使用的 lambda 值捕获并行_for_each。第12行指定计算完成后要复制回主机的数据。
To illustrate this, consider Figure 18.4, which rewrites Figure 18.2 to use explicit array operations. Each array_view is replaced with an array declaration of the same extent. Lines 5 and 6 show explicit copies from host data to an array using the C++ AMP copy function template. The lambda is changed slightly to capture array variables by reference rather than the default mode of capturing variables by value as in the other examples. C++ AMP array objects must be captured by reference while array_view objects must be captured by value for the lambda used in a parallel_for_each. Line 12 specifies the data to be copied back to the host after completion of the computation.
图 18.4显式内存和副本管理。
Figure 18.4 Explicit memory and copy management.
在无法访问主机内存的加速器上,图 18.4中的所有操作也会发生在图 18.2中的代码中,但它们在启动parallel_for_each或调用array_view::synchronize时透明地执行。显式机制的预期用途是提供对内存管理的更多控制,并允许更早启动复制操作并与其他计算重叠(尽管可以通过其他方式实现重叠复制)。
On an accelerator that cannot access host memory, all of the operations in Figure 18.4 also happen for the code in Figure 18.2 but they are performed transparently either when the parallel_for_each is launched or when array_view::synchronize is called. The intended use of the explicit mechanisms is to provide more control of memory management and allow copy operations to be initiated earlier and overlapped with other computations (although overlapped copies can be achieved through other means).
当array_view覆盖主机上的存储但在加速器上访问时,数据将被复制到该加速器上的未命名数组,并对该数组进行访问。主机数据的此副本可能会在array_view的剩余生命周期中持续存在。这允许 C++ AMP 运行时避免将相同数据冗余复制到加速器。 C++ AMP 提供操作来影响在这些隐式副本和源存储之间复制数据的方式和时间。图 18.2的第 8 行显示了array_view::discard_data的使用。此方法断言存储在主机存储器中的值是无关紧要的,例如,因为它们即将被覆盖。此断言的效果是,当array_view随后在parallel_for_each中使用时,不会执行从源数据到为加速器访问而创建的隐式数组的复制。
When an array_view overlays storage on the host but is accessed on the accelerator, the data is copied to an unnamed array on that accelerator and the access is made to that array. This copy of the host data may persist for the remainder of the lifetime of the array_view. This allows the C++ AMP runtime to avoid redundant copies of the same data to the accelerator. C++ AMP provides operations to influence how and when data is copied between these implicit copies and the source storage. Line 8 of Figure 18.2 shows the use of array_view::discard_data. This method is an assertion that the values stored in the host storage are immaterial, for example, because they are about to be overwritten. The effect of this assertion is that when the array_view is subsequently used in a parallel_for_each, no copy is performed from the source data to the implicit array created for accelerator access.
当创建未命名数组来保存与array_view关联的数据副本并且该数组可能会被修改时,C++ AMP允许运行时系统立即将值复制回主机存储或将它们保留在加速器上。如果array_view被破坏或在主机上访问元素,则将立即复制值以确保主机访问获得最新的定义。array_view::synchronize方法可用于强制由特定程序点执行任何此类复制。array_view::refresh方法向 C++ AMP 运行时指示应丢弃主机数据的所有缓存副本。一般情况下,不通过array_view访问而直接修改底层主机数据时,会使用该方法。隐式缓存副本和底层主机数据之间的一致性是程序员的责任。
When an unnamed array is created to hold a copy of data associated with an array_view, and that array may be modified, the C++ AMP runtime system is permitted to copy the values back to the host storage immediately or leave them on the accelerator. If the array_view is destructed or an element is accessed on the host, then values will be copied promptly to make sure host accesses get the most recent definition. The method array_view::synchronize is available to force any such copies to be performed by a particular program point. The method array_view::refresh indicates to the C++ AMP runtime that all cached copies of the host data should be discarded. Generally, this method would be used when the underlying host data is modified directly without accessing through the array_view. This coherence between implicit cached copies and the underlying host data is the responsibility of the programmer.
array_view也可以指array。这允许主机访问加速器上分配的数据。同样,在必要时,这可能涉及创建主机可访问的数据副本。使用与以前相同的机制和功能来控制加速器上的源存储和主机上的副本之间的数据值副本。
An array_view may also refer to an array. This allows data allocated on an accelerator to be accessed by the host. Again, where necessary, this may involve creating copies of the data that are accessible by the host. The copies of data values between the source storage on the accelerator and the copies on the host are controlled using the same mechanisms and functions as before.
大多数在加速器上启动工作的 C++ AMP 操作(包括将数据复制到加速器的操作)都是异步的。这意味着主机操作返回并且主机线程在工作完成之前继续执行下一条语句。我们在图 18.5中对此进行了说明,它显示了三个并发活动,其中时间逻辑上从图的顶部流向底部。左侧是启动加速器操作的主机操作序列。在中间,我们指示三个复制操作,每个操作都需要一些时间。在右侧,我们显示了实际的数据并行计算,该计算在到加速器的两个副本完成之后开始,并在返回主机的最终副本开始之前完成。在主机上,在数据准备好之前调用最终的复制输出,并且该操作将阻塞,直到复制完成。当它返回时,将执行return语句,并且该函数将返回更新后的主机数据。
Most C++ AMP operations that initiate work on an accelerator, including operations to copy data to the accelerator, are asynchronous. This means that the host operation returns and the host thread continues to the next statement before the work completes. We illustrate this in Figure 18.5, which shows three strands of concurrent activity where time logically flows from the top to the bottom of the figure. On the left is the sequence of host operations that initiate accelerator operations. In the middle, we indicate three copy operations that take some duration each. On the right, we show the actual data-parallel computation that begins after the two copies to the accelerator complete and finishes before the final copy back to the host begins. On the host, the final copy-out is called before the data is ready and that operation blocks until the copy completes. When it returns, the return statement executes and the function returns with updated host data.
图 18.5并发主机/加速器执行。
Figure 18.5 Concurrent host/accelerator execution.
为了提供有关加速器上的操作已完成的更细粒度的通知,C++ AMP 提供了completion_future类。此类类似于std::shared_future,它是用于协调异步操作的 C++ 标准方法。特别是,它提供了completion_future::get方法,该方法会阻塞调用线程,直到异步操作完成。 C++ AMP 有以下变体这里讨论的方法是非阻塞的并返回一个completion_future。特别是有array_view::synchronize_async和copy_async的各种重载。这些将启动隐含的数据传输并立即返回同步对象,而不是阻塞线程直到操作完成。图 18.6提供了一个简单的说明,其中我们假设在向量加法计算之后还有一些涉及未修改的主机数据A和B 的其他计算。完成其他处理后,主机将通过对从array_view::synchronize_async方法返回的对象使用completion_future::get调用,等待来自parallel_for_each的结果在主机上可用。get调用返回后,主机向量C将保存结果。
To provide finer-grain notification on which operation on the accelerator is complete, C++ AMP provides the completion_future class. This class is analogous to std::shared_future, the C++ standard method for coordination with asynchronous operations. In particular, it provides the completion_future::get method that blocks the calling thread until the asynchronous operation completes. C++ AMP has variants of the methods discussed here that are nonblocking and return a completion_future. In particular there are array_view::synchronize_async and various overloads of copy_async. These will initiate the data transfer implied and return a synchronization object immediately rather than blocking the thread until the operation has completed. Figure 18.6 provides a simple illustration where we assume that following the vector add computation there is some other computation involving the unmodified host data A and B. Upon completion of that other processing, the host then waits for the results from the parallel_for_each to be available on the host by using the completion_future::get call on the object returned from the array_view::synchronize_async method. After the get call returns, the host vector C will hold the results.
图 18.6重叠的加速器和主机处理。
Figure 18.6 Overlapped accelerator and host processing.
正如第 3 章中所讨论的,CUDA 有一个明确的全局内存概念,内核中的所有线程都可以访问它。在 C++ AMP 中,此概念仅在将数组对象与加速器关联时可用。 C++ AMP 不提供让加速器上运行的函数访问文件范围对象的工具,就像 CUDA 将__device__解释为文件范围对象声明的限定一样。类似地,C++ AMP 没有公开常量内存的概念,尽管在传递给parallel_for_each的顶级 lambda 中捕获的值可以存储在常量内存中。 CUDA 和 C++ AMP 之间的差异代表了 C++ AMP 为简化编程模型而进行的有意识的设计选择。 CUDA 的某些元素反映了当前 GPU 架构的细节,这些细节不一定存在于其他形式的加速器中,或者在未来可能会显着减少。 C++ AMP 选择将这些保留为实现细节,而不是模型的一部分。
As discussed in Chapter 3, CUDA has an explicit notion of global memory, which is accessible by all threads in a kernel. In C++ AMP this concept is only available by having array objects associated with an accelerator. C++ AMP does not provide a facility for having file-scope objects accessible by functions running on the accelerator the way CUDA interprets __device__ as a qualification on file-scope object declarations. Similarly, C++ AMP does not expose a concept of constant memory although values captured in the top-level lambda passed to a parallel_for_each may be stored in constant memory. The differences between CUDA and C++ AMP represent conscious design choices for C++ AMP to simplify the programming model. Some elements of CUDA reflect specifics of current GPU architectures that are not necessarily present in other forms of accelerators or may be significantly less common in the future. C++ AMP chose to leave these as implementation details rather than part of the model.
在本节中,我们讨论了 C++ AMP 的功能,这些功能支持不与主机共享内存并与主机计算并发运行的离散加速器。主要特性是数组数据容器、显式复制操作和显式异步工作机制。我们还指出了在针对离散加速器时使用更灵活的array_view时制作此类副本的时间和地点。我们讨论了 CUDA 内存类型与 C++ AMP 内存类型的关系。
In this section we have discussed the features of C++ AMP that support a discrete accelerator that does not share memory with the host and runs concurrently with host computations. The key features are the array data container, explicit copy operations, and explicit asynchronous work mechanisms. We also indicated when and where such copies are made when the more flexible array_view is used when targeting discrete accelerators. We discussed the relationship of CUDA memory types with that of C++ AMP.
计算机系统可以包括适合于实现C++ AMP数据并行计算的多个加速器。这既包括专用硬件加速器(例如 GPU),也包括使用具有 SIMD 指令的多核 CPU。系统还可能具有多个 GPU,这些 GPU 可能具有类似的硬件特性,也可能不具有类似的硬件特性。 C++ AMP 具有枚举可用加速器并管理如何将工作映射到这些加速器的机制。
A computer system may include multiple accelerators suitable for implementing C++ AMP data-parallel computations. This includes both specialized hardware accelerators such as GPUs and simply the use of multicore CPUs with SIMD instructions. A system may also have multiple GPUs that may or may not have similar hardware characteristics. C++ AMP has mechanisms to enumerate available accelerators and to manage how work is mapped to those accelerators.
类加速器是 C++ AMP 抽象,用于实现数据并行性的特定机制。如图18.7所示,accelerator::get_all静态方法返回一个向量系统中可用的加速器。当需要特殊要求时,可以使用与每种加速器相关的一些属性来选择加速器。例如,支持双精度数据类型是一个可选功能。对于计算密集型应用程序,可能需要避免将工作放在用于驱动交互式显示的 GPU 上。其他属性包括专用于加速器的内存量 ( accelerator::dedicated_memory ) 和唯一标识设备的std::wstring ( accelerator::device_path )。该示例使用 STL std::find算法来捕获此搜索。
The class accelerator is the C++ AMP abstraction used for a specific mechanism for implementing data parallelism. As shown in Figure 18.7, the accelerator::get_all static method returns a vector of available accelerators in the system. A few properties associated with each accelerator may be used to select one when special requirements are required. For example, support of double-precision data types is an optional feature. For compute-intensive applications, it may be desirable to avoid placing work on the GPU that is used to drive an interactive display. Other properties include the amount of memory dedicated to the accelerator (accelerator::dedicated_memory) and a std::wstring that uniquely identifies the device (accelerator::device_path). The example uses the STL std::find algorithm to capture this search.
图 18.7寻找加速器的示例。
Figure 18.7 Example of finding an accelerator.
除了寻找特定的加速器之外,系统还可以支持多个合适的加速器。 C++ AMP 能够将工作从一个或多个主机线程卸载到多个加速器。所有此类加速器实例均通过调用Accelerator::get_all返回,并且它们可以由应用程序同时使用。
In addition to finding a specific accelerator, a system may support multiple suitable accelerators. C++ AMP enables off-loading work from one or more host threads to multiple accelerators. All such accelerator instances are returned by the call to accelerator::get_all and they may be used concurrently by an application.
在 C++ AMP 中,accelerator_view是一个对象,它引用特定的底层加速器,可用于指定该加速器,以指示在何处分配数组以及应在何处执行特定parallel_for_each的工作。与 CUDA 流 ( cudaStream_t ) 类似,针对特定Accelerator_view执行的各种操作是按顺序执行的,但对不同Accelerator_view的操作没有定义的顺序。
In C++ AMP, an accelerator_view is an object that refers to a specific underlying accelerator and can be used to specify that accelerator for the purpose of indicating where an array is allocated and where work for a particular parallel_for_each should be executed. Similar to a CUDA stream, (cudaStream_t), various operations performed against a particular accelerator_view are performed in order but operations on different accelerator_views have no defined order.
在 C++ AMP 中,有一个由运行时自动选择的默认加速器,但可以使用 Accelerator ::set_default静态方法显式设置,该方法采用设备路径字符串参数。每个加速器都有一个默认的accelerator_view (加速器::default_view)。当未指定时,默认加速器的默认视图用于分配数组。 Parallel_for_each还可以具有显式的Accelerator_view。图 18.8是向量相加示例的一个变体,它显式地使用了默认值。无需使用显式数组来指导使用Accelerator_view的工作。即使使用覆盖主机数据的array_view对象访问所有数据, parallel_for_each也可能有一个显式的Accelerator_view指示应在何处执行工作。
In C++ AMP there is a default accelerator that is automatically selected by the runtime but can be explicitly set using the accelerator::set_default static method, which takes a device path string parameter. Each accelerator has a default accelerator_view (accelerator::default_view). The default view of the default accelerator is used for allocating an array when none is specified. A parallel_for_each may also have an explicit accelerator_view. Figure 18.8 is a variant of the vector add sample that makes use of defaults explicit. It is not necessary to use explicit arrays to direct work using an accelerator_view. Even when all data is accessed with array_view objects that overlay host data, a parallel_for_each may have an explicit accelerator_view indicating where the work should be performed.
图 18.8显式加速器的使用。
Figure 18.8 Explicit accelerator use.
图 18.9是对Accelerator_view显式使用的另一个说明。在这里,我们提供了一个修改后的向量加法操作,该操作由Accelerator_view参数化,该视图标识应该在哪里执行工作。该函数确定加速器上的可用内存,从千字节转换为字节,并用于确定可以同时存储三个块的最大块大小(块)。然后,第 8 行以该大小的块的形式循环输入向量。对于每个块,都会像图 18.2中那样启动计算,但这里加速器是由第一个参数acc显式指定给parallel_for_each的。在第 17 行,我们启动将结果异步传输回主机数据结构。此操作返回的completion_future 被移动到此类结果的向量中。所有操作启动后,第 19 行和第 20 行使用 C++ STL 方法迭代结果向量,并在函数返回到调用者之前通过调用get方法等待每个操作完成。
Figure 18.9 is another illustration of explicit use of an accelerator_view. Here we provide a modified vector add operation that is parameterized by an accelerator_view that identifies where the work should be performed. The function determines the memory available on the accelerator, converted from kilobytes to bytes and used to determine the largest block size (block) where three blocks may be stored concurrently. Line 8 then loops over the input vectors in chunks of this size. For each chunk, a computation is launched as was done in Figure 18.2 but here the accelerator is explicitly specified by the first parameter, acc, to the parallel_for_each. On line 17, we initiate an asynchronous transfer of the results back to the host data structure. The completion_future returned by this operation is moved into a vector of such results. After all operations are started, lines 19 and 20 iterate over the vector of results using C++ STL methods and wait for each one to complete by calling the get method before the function returns to the caller.
图 18.9具有异步传输的显式加速器。
Figure 18.9 Explicit accelerator with asynchronous transfers.
本节涉及对某些场景很重要的主题。我们讨论数据并行性的“平铺”版本以及用于优化该模型中可用内存的附加工具。
This section touches on a topic important for some scenarios. We discuss a “tiled” version of data parallelism and the additional tools for optimizing memory available in that model.
如前所述,数据并行计算具有由 C++ AMP范围对象定义的关联计算域。等级 3 或更低的计算域也可以被分成称为图块的规则的矩形子域。这些图块的宽度必须是编译时常量。与同一图块相关联的线程可以共享变量并参与屏障同步。在 CUDA 中,术语“块”用于描述这些线程组。 C++ AMP 中还添加了一个新的存储类,tile_static,以指示每个图块有一个由所有线程共享的实例的变量(在 CUDA 中,这用__shared__关键字指示)。第 5 章讨论使用平铺和平铺共享变量来优化内存带宽的动机。具有此存储类别的对象只能在restrict(amp)代码中访问。
As described earlier, a data-parallel computation has an associated computational domain defined by a C++ AMP extent object. A computational domain of rank 3 or less may also be blocked into regular, rectangular subdomains called tiles. The widths of these tiles must be compile-time constants. The threads that are associated with the same tile may share variables and participate in barrier synchronization. In CUDA, the term block is used to describe these groups of threads. A new storage class is also added to C++ AMP, tile_static, to indicate a variable that has a single instance per-tile that is shared by all threads (in CUDA this is indicated with the __shared__ keyword). Chapter 5 discusses the motivation for using tiling and tile-shared variables to optimize memory bandwidth. Objects with this storage class may only be accessed in restrict(amp) code.
我们通过使用矩阵乘法来说明第 5 章中所做的平铺。图 5.12显示了我们在此处将其扩展为主机的 CUDA 内核包含内核的函数(图 18.10),以及假设主机指针用于引用第 5 章接口后面的密集数组。和以前一样,我们将array_view对象覆盖在主机数据之上,并丢弃即将被覆盖的输出数据,因此不会将其复制到加速器。
We illustrate tiling as was done in Chapter 5 by using matrix multiplication. Figure 5.12 shows a CUDA kernel that we expand here into a host function (Figure 18.10) containing the kernel, as well as assuming host pointers are used to refer to dense arrays following the interface from Chapter 5. As before, we overlay array_view objects on top of the host data and discard the output data that is about to be overwritten so it is not copied to the accelerator.
tiled_extent是一种范围形式,它将图块尺寸捕获为模板参数。 C++ AMP 仅支持一维、二维和三维的平铺,并且tiled_extent对象的等级是根据指定的平铺维度数推断的。在本例中,tiled_extent的等级为 2(第 6 行)。
A tiled_extent is a form of extent that captures tile dimensions as template parameters. C++ AMP only supports tiling for one, two, and three dimensions, and the rank of a tiled_extent object is inferred from the number of tile dimensions specified. In this case, the tiled_extent has rank 2 (line 6).
parallel_for_each方法有一个针对tiled_extent的重载。结构与以前相同,并且将为计算域中的每个元素调用一次 lambda 函数。 C++ AMP 要求计算域的范围必须能被切片大小整除。在此示例中,宽度必须是TILE_WIDTH的倍数。当不满足此条件时,将引发运行时异常。
The parallel_for_each method has an overload for tiled_extent. The structure is the same as before and the lambda function will be invoked once for each element in the compute domain. C++ AMP requires that the extent of the compute domain must be evenly divisible by the tile size. In this example, Width must be multiples of TILE_WIDTH. When this condition is not met, a runtime exception is thrown.
对于tiled_extent的parallel_for_each,lambda的参数必须是tiled_index而不是index。 tiled_index是一个类模板,其中图块大小再次被捕获为模板参数。 tiled_index (图 18.10中的t_idx )既提供了每个线程到计算域( t_idx.global )的映射,也提供了线程在其瓦片(t_idx.local)内的相对位置。
In the case of a parallel_for_each for a tiled_extent, the parameter to the lambda must be a tiled_index instead of an index. The tiled_index is a class template where again the tile sizes are captured as template parameters. The tiled_index (t_idx in Figure 18.10) provides both a mapping for each thread into the compute domain (t_idx.global) as well as the relative position of a thread within its tile (t_idx.local).
第9行声明了一个名为Mds的tile_static数组,该数组由tile中的所有线程共享。它将保存M中值的副本,这些值是为图块中的所有线程执行子块矩阵乘法计算所需的。类似地,第 10 行声明类似的Nds来保存N的子块。
Line 9 declares a tile_static array named Mds that is shared by all threads in a tile. It will hold a copy of the values in M that are needed to perform a sub-block matrix multiplication computation for all of the threads in the tile. Similarly, line 10 declares analogous Nds to hold sub-blocks of N.
如图5.12所示,图 18.10第 14 行的循环将块行乘以图块大小的块中的块列。变量Width由所有线程统一使用,并从包含函数作用域中捕获,以便在 lambda 中自动重用。图块中的线程协作地将M和N块复制到tile_static存储中。第 17 行是屏障同步点,图块中的所有线程都在此处等待共享变量的存储完成。第 20 行的第二个屏障确保在下一次迭代开始写入之前完成对共享变量的所有读取。在 C++ AMP 中, tile_index类型的对象包含一个tile_barrier对象作为数据成员,并且该对象提供执行屏障的方法。 C++ AMP 提供不同形式的屏障,指示屏障是否仅适用于tile_static数据、全局数据或两者。这里我们只需要保护tile_static数据,因此可以使用wait_with_tile_static_memory_fence,但选择使用wait方法来匹配第5章的源代码。
As in Figure 5.12, the loop on Figure 18.10, line 14, multiplies a block-row times a block column in tile-size chunks. The variable Width is used uniformly by all threads and is captured from the containing function scope for reuse in the lambda automatically. The threads in the tile cooperatively copy blocks of M and N into tile_static storage. Line 17 is the barrier synchronization point where all threads in the tile wait for the stores into shared variables to complete. A second barrier on line 20 makes sure all of the reads from shared variables are completed before writes on the next iteration begin. In C++ AMP, the object of type tile_index includes a tile_barrier object as a data member and that object provides methods to perform barriers. C++ AMP provides different forms of barriers that indicate whether the barrier applies to just tile_static data, global data, or both. Here we only need to protect tile_static data and so could use wait_with_tile_static_memory_fence, but chose to use the wait method to match the source from Chapter 5.
图 18.11说明了 C++ AMP 平铺的一些细节。它显示了一个 20×20 的计算域作为小方块网格和代码片段中的变量e 。行(维度 0)从上到下编号,列(维度 1)从左到右编号。该域可能被分成 8×8 块。这些图块用较大的黑色方块和变量te或变量te2进行说明,它显示了用于创建tiled_extent的extent::tile方法模板。我们还说明了如何使用 C++ 11 auto关键字从变量的初始值设定项推断变量的类型。
Figure 18.11 illustrates some details of C++ AMP tiling. It shows a 20×20 compute domain as a grid of small squares and the variable e in the code fragment. Rows (dimension 0) are shown as numbered from top to bottom and columns (dimension 1) from left to right. This domain might be blocked into 8×8 tiles. These tiles are illustrated with the larger black squares and the variable te or alternately the variable te2, which shows the extent::tile method template for creating a tiled_extent. We also illustrate the use of C++ 11 auto keyword to infer types of variables from their initializers.
图 18.11平铺 20×20 计算域的图示。
Figure 18.11 Illustration of tiling 20×20 compute domain.
请注意,此示例中的图块大小并未均匀划分计算域的维度。平铺的parallel_for_each要求范围是每个维度中平铺大小的倍数,如果不是这种情况,开发人员必须显式处理边界情况。tiled_extent类模板提供了填充或截断底层范围的方法。在该示例中,变量pte对应于填充的范围,extent<2>(24,24),而变量tte对应于截断的范围,extent<2>(16,16)。
Note that the tile size in this example does not evenly divide the dimensions of the compute domain. A tiled parallel_for_each requires the extent be a multiple of the tile size in each dimension, and the developer must explicitly handle the boundary cases when this is not the case. The tiled_extent class template provides methods to either pad or truncate the underlying extent. In the example, variable pte corresponds to the padded exetent, extent<2>(24,24), while the variable tte corresponds to the truncated extent, extent<2>(16,16).
tiled_index参数支持多种成员以方便平铺计算。全局成员是一个索引<2>,保存底层计算域中的位置。图中的实心方块对应于计算域中的位置(9,6)。瓦片集(大方块)形成一个域,在本例中为extent<2>(3,3) ,由tile_extent成员返回。tile成员是一个索引<2>,保存投影到该域中的点的位置。高亮显示的点 (9,6) 位于图块 (1,0) 中。左边缘的单个浅色阴影方块是与点 (9,6) 相同的图块中每个维度中的第一个元素。这可用作tile_origin,在本例中对应于全局索引(8,0)。最后,图块内的点可以被认为是一个小域,本地成员返回这个空间 (1,6) 中的位置,该空间基本上是通过从global中减去tile_origin形成的。
The tiled_index parameter supports a variety of members to facilitate tiled computations. The global member is an index<2> holding the position in the underlying compute domain. The solid square in the figure cooresponds to position (9,6) in the compute domain. The set of tiles (large squares) forms a domain, extent<2>(3,3) in this case, which is returned by the tile_extent member. The tile member is an index<2> holding the position of a point projected into this domain. The highlighed point (9,6) is in tile (1,0). The single lightly shaded square at the left edge is the first element in each dimension in the same tile as point (9,6). This is available as tile_origin and in this example corresponds to the global index (8,0). Finally, the points within a tile can be thought of as a small domain and the local member returns the position in this space (1,6) formed basically by subtracting tile_origin from global.
C++ AMP 的主要动机是支持数据并行作为通用计算的重要算法模式。渲染和图像处理是非常重要的主流工作负载,C++ AMP 为其提供了一些更专业的支持,在本节。这些工具包括标准化浮点、短向量类型、纹理,以及(可选)在 Microsoft 平台上与 DirectX 的互操作。其中许多功能都被隔离到一个单独的命名空间concurrency::graphics 中。 图 18.12说明了该命名空间中定义并在本节中讨论的一些类型。
The primary motivation for C++ AMP is to support data parallelism as an important algorithm pattern for general computing. Rendering and imaging processing are very important mainstream workloads for which C++ AMP includes some more specialized support, discussed briefly in this section. These facilities include normalized floating points, short vector types, textures, and, optionally on Microsoft platforms, interoperations with DirectX. Many of these features are segregated into a separate namespace, concurrency::graphics. Figure 18.12 illustrates some of the types defined in that namespace and discussed in this section.
图 18.12 concurrency::graphics的类型示例。
Figure 18.12 Example of types from concurrency::graphics.
C++ AMP 提供两种类型:norm和unorm,它们提供本质上是浮点但范围有限的算术。 norm类型保存大小不超过 1 的有符号值,而unorm类型保存大小不超过 1 的非负值。在这些类型上定义了常见的算术运算,其中超出范围的结果值被强制为极值(“限制”)。这些类型可以与 C++ 类型混合并转换为float。它们还可以用作 C++ AMP 复合类型array、array_view和下面描述的纹理对象的元素类型。
C++ AMP provides two types, norm and unorm, which provide arithmetic that is floating point in nature but of bounded range. The norm type holds signed values with magnitude no more than one while the unorm type holds non-negative values with magnitude no more than one. Common arithmetic operations are defined on these types where result values that would exceed the range are forced to the extreme value (“clamped”). These types may be mixed with C++ types and convert to float. They may also be used as element types for C++ AMP composite types array, array_view, and the texture objects described in the following.
图形程序经常操作原始类型的短向量。 C++ AMP 通过包含这些定义来支持图形编程。对于 C++ AMP 类型,int、unsigned int(如uint)、float、double、norm和unorm,并且对于每个向量长度 2、3 和 4,都存在int_2、uint_3和float_4等类型。其中每个都包含许多可通过名称访问的组件值。支持的名称有x、y、z和w,或者r、g、b和w。因此,根据图 18.12中的声明,我们可以访问组件f4.z,它是可以用作右值或左值的单个浮点。还支持某些复合模式,例如f4.xy,它对应于适当长度的短向量,在本例中为float_2,可以用作右值或左值。短向量的赋值和算术是以组件方式完成的,标量参数提升为每个组件中具有该值的向量。
Graphics programs frequently manipulate short vectors of primitive types. C++ AMP supports graphics programming by including definitions of these. For C++ AMP types, int, unsigned int (as uint), float, double, norm, and unorm, and for each vector length 2, 3, and 4, there exist types such as int_2, uint_3, and float_4. Each of these holds a number of component values that are accessed by name. The names supported are x, y, z, and w, or alternately r, g, b, and w. Thus, given the declarations in Figure 18.12, we might access a component f4.z, which is a single float that can be used as either an rvalue or an lvalue. Certain compound patterns are also supported, such as f4.xy, which corresponds to a short vector of suitable length, float_2 in this case, that may be used as either an rvalue or lvalue. Assignment and arithmetic on short vectors is done in a component-wise style with scalar arguments promoted to vectors with that value in each component.
纹理是一种特殊形式的数组,允许数据并行代码访问使用降低的精度存储的值。这是图像数据的常见表示形式,并且是 C++ AMP 第一个版本中用于在restrict(amp)上下文中访问部分字数据类型的唯一方法。与数组一样,纹理是一个类模板,由元素类型和等级参数化。允许的元素类型集被限制为restrict(amp)兼容的基元类型及其短向量变体的子集。
A texture is a special form of array that allows data-parallel code to access values that are stored using reduced precision. This is a common representation for image data and is the only method in the first version of C++ AMP to access partial word data types in a restrict(amp) context. Like an array, a texture is a class template that is parameterized by an element type and a rank. The set of allowed element types is constrained to be a subset of the restrict(amp) compatible primitive types and their short vector variants.
当构造纹理时,除了范围和数据源之外,最终的无符号整数参数指示用于存储该值的每个基元数据值的位数。第 15 行显示了一个示例纹理,其中包含无符号归一化浮点值的四宽向量。传递给构造函数的 16U表示每个值仅存储 16 位信息。并非所有数据类型、向量长度和存储宽度的组合均受支持(规范中的详细信息列于 C++ AMP 开放规范中,http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03 /c-amp-open-spec-published.aspx)。
When a texture is constructed, in addition to the extent and a data source, a final unsigned integer argument indicates the number of bits per primitive data value used to store the value. Line 15 shows an example texture with a four-wide vector of unsigned normalized floating-point values. The 16U passed to the constructor indicates each of these values is stored with only 16 bits of information. Not all combinations of data type, vector length, and storage width are supported (details in the specification are listed in the C++ AMP open specification, http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx).
纹理是像数组一样的存储容器,并且可以与特定的Accelerator_view相关联。纹理也像数组一样通过索引运算符的重载进行索引,并以适当等级的索引实例作为参数。对于数组,这些操作是restrict(amp),并且不能在主机代码中使用。函数模板副本的重载支持与主机数据结构之间的传输。
A texture is a storage container like an array and may be associated with a particular accelerator_view. A texture is also indexed like an array with overloads of the index operator with an index instance of suitable rank as a parameter. As for array, these operations are restrict(amp) and may not be used in the host code. Overloads of the function template copy support transfers to and from host data structures.
纹理的子集可以直接写入,这是通过texture::set方法显式完成的。对于硬件加速器不直接支持写入的纹理格式,C++ AMP 提供writeonly_texture_view类模板,并用名为wotv的变量进行说明(图 18.12的第 16 行)。该对象上的set方法可以在限制(amp)上下文中使用,该上下文定义纹理中的值。
A subset of textures may be written to directly and this is done explicitly via a texture::set method. For texture formats for which writing is not directly supported by hardware accelerators, C++ AMP provides the writeonly_texture_view class template illustrated with the variable named wotv (line 16 of Figure 18.12). The set method on this object may be used in a restrict(amp) context that is defining values in a texture.
除了对这些类型的支持之外,Microsoft 平台上的 C++ AMP 还包含支持与 DirectX 框架互操作的特定功能。这些接口在两个命名空间中可用:concurrency::direct3d包含 make_array、get_buffer 和 create_accelerator_view而concurrency::graphics::direct3d包含 make_texture。它们包括以下功能:
Beyond support for these types, C++ AMP on Microsoft platforms includes specific features to enable interoperation with the DirectX framework. These interfaces are available in two namespaces : concurrency::direct3d contains make_array, get_buffer and create_accelerator_view while concurrency::graphics::direct3d contains make_texture. They include the following capabilities:
• 将现有 Direct 3D 设备接口指针视为 C++ AMP Acclerator_view。
• Treating an existing Direct 3D device interface pointer as a C++ AMP acclerator_view.
• 将现有 Direct 3D 缓冲区接口指针视为 C++ AMP数组。
• Treating an existing Direct 3D buffer interface pointer as a C++ AMP array.
• 将现有 Direct 3D 纹理接口指针视为 C++ AMP纹理。
• Treating an existing Direct 3D texture interface pointer as a C++ AMP texture.
这些功能使 C++ AMP 能够为 GPU 计算场景提供与 DirectX 渲染框架平滑集成的 C++ 语言解决方案。
These capabilities allow C++ AMP to provide a C++ language solution for GPU compute scenarios that integrates smoothly with the DirectX rendering framework.
图 18.13说明了互操作功能。函数my_rotate使用位于主机上的顶点数据向量。参数d3ddevice是现有的DirectX接口,用于首先构造accelerator_view,然后构造数组。 parallel_for_each执行顶点旋转,结果留在加速器上。由于数组实例顶点位于特定的accelerator_view上,因此parallel_for_each将在同一accelerator_view上执行。我们提取底层缓冲区对象(仅键入为IUnknown)并将其返回给调用者以供后续在场景渲染中使用。
Figure 18.13 illustrates the interop features. Function my_rotate consumes a vector of vertex data that is located on the host. Parameter d3ddevice is the existing DirectX interface that is used to first construct an accelerator_view and then an array. The parallel_for_each performs a rotation of the vertices where the result is left on the accelerator. Since the array instance vertices is located on a particular accelerator_view, the parallel_for_each will be executed on that same accelerator_view. We extract the underlying buffer object (typed only as IUnknown) and return this to the caller for subsequent use in scene rendering.
图 18.13 DirectX 互操作示例 — 旋转顶点列表。
Figure 18.13 Example DirectX interop—rotate vertex list.
本章概述了 C++ AMP,它是 C++ 11 的一个小型扩展,用于支持数据并行计算的硬件加速。讨论尚未完成,但完整规范可在http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx上找到。 C++ AMP 的重点是创建能够很好地集成到现代 C++ 中的功能,并利用模板、lambda 和 future 等功能来提供与 C++ 和并行性的其他方面组合的高效抽象集。这些功能是分层的,允许非常广泛的计算机体系结构知识有限的开发人员使用,并提供对最关键性能场景所需的丰富执行模型的访问。降低表达数据并行性的障碍并确保跨硬件平台的可移植性将有助于更多应用程序提供硬件加速和异构计算的优势。
This chapter has presented an overview of C++ AMP, a small extension to C++ 11 to support hardware acceleration of data-parallel computations. The discussion is not complete, but the full specification is available at http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx. The focus of C++ AMP is to create features that integrate well into modern C++ and leverage features such as templates, lambdas, and futures to provide a highly productive set of abstractions that compose with other aspects of C++ and parallelism. The features are layered to allow use by a very broad set of developers with limited knowledge of computer architecture, as well as providing access to the rich execution model needed for the most performance-critical scenarios. Lowering the barrier to expressing data parallelism and ensuring portability across hardware platforms will help more applications deliver the benefits of hardware acceleration and heterogeneous computing.
18.1. 将矩阵乘法的简单、直到版本转换为 C++ AMP。 CUDA内核如图4.7所示。编写一个主机函数,将该计算应用于三个array_view<float,2>输入。不要实现C = A*B ,而是在输出中累积并实现C += A*B。
18.1. Translate the simple, untiled version of matrix multiplication into C++ AMP. The CUDA kernel is shown in Figure 4.7. Write a host function that applies this computation to three array_view<float,2> inputs. Rather than implementing C = A∗B, accumulate in the output and implement C += A∗B.
18.2. 给定一个等级 2、 X、index<2> ij和程度<2> e的数组视图,操作X.section(ij,e)返回一个新的array_view ,它覆盖与X相同的数据。如果我们将这个新视图表示为S ,那么对于S的所有有效索引idx ,我们有S[idx]与X[idx+ij]相同的位置。
假设现在有三个array_view<float,2>对象:A、B和C。假设它们不会同时适合系统中加速器的专用内存。使用array_view::section方法、显式数组对象和第一个练习中的矩阵乘法构建块来实现大型数组的矩阵乘法。
18.2. Given an array view of rank 2, X, index<2> ij, and extent<2> e, the operation X.section(ij,e) returns a new array_view that overlays the same data as X. If we denote this new view as S, then for all valid indices idx of S we have S[idx] is the same location as X[idx+ij].
Assume now there are three array_view<float,2> objects, A, B, and C. Assume they will not fit simultaneously in the dedicated_memory of the accelerator in the system. Use the array_view::section method, explicit array objects, and the matrix multiply building block from the first exercise to implement matrix multiplication for the large arrays.
18.3。 假设std::vector gpu包含两个Accelerator_view类型的元素,它们引用不同但相似的 GPU系统。修改练习 18.3的解决方案以使用两个加速器来实现工作。
18.3. Assume std::vector gpu holds two elements of type accelerator_view that refer to different but similar GPUs in a system. Modify the solution to Exercise 18.3 to use both accelerators to implement the work.
18.4。 将练习 4.2中矩阵转置的平铺版本转换为 C++ AMP。
18.4. Translate the tiled version of matrix transpose from Exercise 4.2 into C++ AMP.
18.5。 图18.3中的内部循环通过多线程中使用的atom_view冗余加载数据,并且这些引用没有合并(参见第6.2节)。重写图18.3中的函数,使用tile_static内存来提高访问atom_view中数据的内存效率。
18.5. The inner loop in Figure 18.3 redundantly loads data through atom_view that is used in multiple threads and these references are not coalesced (see Section 6.2). Rewrite the function in Figure 18.3 to use tile_static memory to improve the memory efficiency for accessing the data in atom_view.
19.1 背景
19.1 Background
19.2 一个运行的例子
19.2 A Running Example
19.3 MPI 基础知识
19.3 MPI Basics
19.4 MPI 点对点通信类型
19.4 MPI Point-to-Point Communication Types
19.5 重叠计算和通信
19.5 Overlapping Computation and Communication
19.6 MPI集体通讯
19.6 MPI Collective Communication
19.7 概括
19.7 Summary
19.8 练习
19.8 Exercises
到目前为止,我们专注于用一台主机和一台设备编写异构计算系统。在高性能计算(HPC)中,许多应用程序需要计算节点集群的聚合计算能力。如今,许多 HPC 集群的每个节点都有一台或多台主机和一台或多台设备。从历史上看,这些集群主要使用消息传递接口 (MPI) 进行编程。在本章中,我们将介绍 MPI/CUDA 联合编程。读者应该能够轻松地将材料扩展到 MPI/OpenCL、MPI/OpenACC 等。我们将仅介绍程序员需要了解的关键 MPI 概念,以将其异构应用程序扩展到集群环境中的多个节点。特别是,我们将重点关注将 CUDA 内核扩展到多个节点的背景下的域分区、点对点通信和集体通信。
So far we have focused on programming a heterogeneous computing system with one host and one device. In high-performance computing (HPC), many applications require the aggregate computing power of a cluster of computing nodes. Many of the HPC clusters today have one or more hosts and one or more devices in each node. Historically, these clusters have been programmed predominately with the Message Passing Interface (MPI). In this chapter, we will present an introduction to joint MPI/CUDA programming. Readers should be able to easily extend the material to joint MPI/OpenCL, MPI/OpenACC, and so on. We will only present the key MPI concepts that programmers need to understand to scale their heterogeneous applications to multiple nodes in a cluster environment. In particular, we will focus on domain partitioning, point-to-point communication, and collective communication in the context of scaling a CUDA kernel into multiple nodes.
尽管 2009 年之前几乎没有顶级超级计算机使用 GPU,但对更高能源效率的需求导致近年来 GPU 的快速采用。当今世界上许多顶级超级计算机在每个节点中都使用 CPU 和 GPU。他们在绿色 500 强名单中的高排名验证了这种方法的有效性,这反映了他们的高能源效率。
While there was practically no top supercomputer using GPUs before 2009, the need for better energy efficiency has led to fast adoption of GPUs in recent years. Many of the top supercomputers in the world today use both CPUs and GPUs in each node. The effectiveness of this approach is validated by their high rankings in the Green 500 list, which reflects their high energy efficiency.
当今计算集群的主导编程接口是 MPI [Gropp1999],它是一组 API 函数,用于在计算集群中运行的进程之间进行通信。 MPI 采用分布式内存模型,其中进程通过相互发送消息来交换信息。当应用程序使用API通信功能时,不需要处理互连网络的细节。 MPI 实现允许进程使用逻辑号码相互寻址,这与在电话系统中使用电话号码的方式非常相似:电话用户可以使用电话号码互相拨号,而无需确切知道被叫方在哪里以及呼叫如何路由。 。
The dominating programming interface for computing clusters today is MPI [Gropp1999], which is a set of API functions for communication between processes running in a computing cluster. MPI assumes a distributed memory model where processes exchange information by sending messages to each other. When an application uses API communication functions, it does not need to deal with the details of the interconnect network. The MPI implementation allows the processes to address each other using logical numbers, much the same way as using phone numbers in a telephone system: telephone users can dial each other using phone numbers without knowing exactly where the called person is and how the call is routed.
在典型的 MPI 应用程序中,数据和工作在进程之间划分。如图19.1所示,每个节点可以包含一个或多个进程,显示为节点内的云。随着这些过程的进展,它们可能需要彼此的数据。这种需求可以通过发送和接收消息来满足。在某些情况下,这些流程还需要在协作执行大型任务时相互同步并生成集体结果。这是通过集体通信 API 函数完成的。
In a typical MPI application, data and work are partitioned among processes. As shown in Figure 19.1, each node can contain one or more processes, shown as clouds within nodes. As these processes progress, they may need data from each other. This need is satisfied by sending and receiving messages. In some cases, the processes also need to synchronize with each other and generate collective results when collaborating on a large task. This is done with collective communication API functions.
图 19.1程序员对 MPI 进程的看法。
Figure 19.1 Programmer’s view of MPI processes.
我们将使用 3D 模板计算作为运行示例。我们假设计算基于有限差分来计算传热求解描述传热物理定律的偏微分方程的方法。在每个步骤中,网格点的值计算为相邻网格点(北、东、南、西、上、下)及其上一个时间步的自身值的加权和。为了实现高数值稳定性,在网格点的计算中还使用每个方向上的多个间接邻居。这称为高阶模板计算。为了本章的目的,我们假设每个方向将使用四个点。如图19.2所示,一个网格点的下一步值的计算共有24个邻居点。如图19.2所示,网格中的每个点都有一个x、y和z坐标。对于坐标值为x = i、y = j、z = k或 ( i , j , k ) 的网格点,其 24 个邻居为 ( i −4, j , k ), ( i −3 ),j , k ), ( i −2, j , k ), ( i −1, j , k ), ( i +1, j , k ), ( i +2, j , k ), ( i +3 ), j , k ), ( i +4, j , k ), ( i , j −4, k ), ( i , j −3, k ), ( i , j −2, k ), ( i , j − 1, k ), ( i , j +1, k ), ( i , j +2, k ), ( i , j +3, k ), ( i , j +4, k ), ( i , j , k −4), ( i , j , k −3), ( i , j , k −2) , ( i , j,k -1),( i,j,k +1),( i,j,k +2),( i,j,k +3),以及( i,j,k +4)。由于每个网格点的下一个数据值是根据 25 个点(24 个相邻点及其自身)的当前数据值计算的,因此这种计算类型通常称为25-stencil 计算。
We will use a 3D stencil computation as a running example. We assume that the computation calculates heat transfer based on a finite difference method for solving a partial differential equation that describes the physical laws of heat transfer. In each step, the value of a grid point is calculated as a weighted sum of neighbors (north, east, south, west, up, down) and its own value from the previous time step. To achieve high numerical stability, multiple indirect neighbors in each direction are also used in the computation of a grid point. This is referred to as a higher-order stencil computation. For the purpose of this chapter, we assume that four points in each direction will be used. As shown in Figure 19.2, there are a total of 24 neighbor points for calculating the next step value of a grid point. As shown in Figure 19.2, each point in the grid has an x, y, and z coordinate. For a grid point where the coordinate value is x=i, y=j, and z=k, or (i,j,k), its 24 neighbors are (i−4,j,k), (i−3,j,k), (i−2,j,k), (i−1,j,k), (i+1,j,k), (i+2,j,k), (i+3,j,k), (i+4,j,k), (i,j−4,k), (i,j−3,k), (i,j−2,k), (i,j−1,k), (i,j+1,k), (i,j+2,k), (i,j+3,k), (i,j+4,k), (i,j,k−4), (i,j,k−3), (i,j,k−2), (i,j,k−1), (i,j,k+1), (i,j,k+2), (i,j,k+3), and (i,j,k+4). Since the next data value of each grid point is calculated based on the current data values of 25 points (24 neighbors and itself), the type of computation is often called 25-stencil computation.
图 19.2 25 个模板计算示例,其中邻居位于x、y和z方向。
Figure 19.2 A 25-stencil computation example, where the neighbors are in the x, y, and z directions.
我们假设系统被建模为结构化网格,其中网格点之间的间距在每个方向上都是恒定的。这允许我们使用 3D 数组,其中每个元素存储网格点的状态。每个维度中相邻元素之间的物理距离可以由间距变量表示。图 19.3展示了一个代表矩形通风管道的 3D 阵列,其中x和y维度为管道的横截面,z维度为沿管道的热流方向。
We assume that the system is modeled as a structured grid, where spacing between grid points is constant within each direction. This allows us to use a 3D array where each element stores the state of a grid point. The physical distance between adjacent elements in each dimension can be represented by a spacing variable. Figure 19.3 illustrates a 3D array that represents a rectangular ventilation duct, with x and y dimensions as the cross-sections of the duct and the z dimension the direction of the heat flow along the duct.
图 19.3用于对管道中的传热进行建模的 3D 网格阵列。
Figure 19.3 3D grid array for modeling the heat transfer in a duct.
我们假设数据布置在内存空间中,x是最低维度,y是次之维度,z是最高维度。也就是说,所有y = 0 且z = 0 的元素将根据其x坐标放置在连续的内存位置中。图 19.4显示了网格数据布局的一个小示例。这个小示例在网格中只有 16 个数据元素:x维度中有两个元素,y维度有两个元素,z维度有四个元素。y = 0 和z = 0 的两个x元素首先放入内存中。它们后面是y = 1 且z = 0 的所有元素。下一组将是y = 0 且z = 1 的元素。
We assume that the data is laid out in the memory space and that x is the lowest dimension, y is the next, and z is the highest. That is, all elements with y=0 and z=0 will be placed in consecutive memory locations according to their x coordinate. Figure 19.4 shows a small example of the grid data layout. This small example has only 16 data elements in the grid: two elements in the x dimension, two in the y dimension, and four in the z dimension. Both x elements with y=0 and z=0 are placed in memory first. They are followed by all elements with y=1 and z=0. The next group will be elements with y=0 and z=1.
图 19.4 3D 网格内存布局的一个小示例。
Figure 19.4 A small example of memory layout for the 3D grid.
当使用集群时,通常将输入数据划分为多个分区(称为域分区),并将每个分区分配给集群中的一个节点。在图 19.3中,我们显示 3D 数组分为四个域分区:D1、D2、D3 和 D4。每个分区都将分配给一个 MPI 计算进程。
When one uses a cluster, it is common to divide the input data into several partitions, called domain partitions, and assign each partition to a node in the cluster. In Figure 19.3, we show that the 3D array is divided into four domain partitions: D1, D2, D3, and D4. Each of the partitions will be assigned to an MPI compute process.
域划分可以用图 19.4进一步说明。图 19.4中四个元素 ( z = 0)的第一部分或切片位于第一个分区中,第二部分 (z=1) 位于第二分区中,第三部分 (z=2) 位于第三分区中,第四部分(z=3)第四个分区。这显然是一个玩具示例。在实际应用中,每个维度通常有数百甚至数千个元素。对于本章的其余部分,记住z切片中的所有元素都位于连续的内存位置是很有用的。
The domain partition can be further illustrated with Figure 19.4. The first section, or slice, of four elements (z=0) in Figure 19.4 is in the first partition, the second section (z=1) the second partition, the third section (z=2) the third partition, and the fourth section (z=3) the fourth partition. This is obviously a toy example. In a real application, there are typically hundreds or even thousands of elements in each dimension. For the rest of this chapter, it is useful to remember that all elements in a z slice are in consecutive memory locations.
与 CUDA 一样,MPI 程序基于 SPMD(单程序、多数据)并行执行模型。所有 MPI 进程执行相同的操作程序。 MPI系统提供了一组API函数来建立允许进程之间进行通信的通信系统。图 19.5显示了为 MPI 应用程序设置和拆除通信系统的五个基本 API 函数。图 19.6显示了一个使用这些 API 函数的简单 MPI 程序。用户需要将程序的可执行文件提供给集群中的mpirun命令或mpiexec命令。
Like CUDA, MPI programs are based on the SPMD (single program, multiple data) parallel execution model. All MPI processes execute the same program. The MPI system provides a set of API functions to establish communication systems that allow the processes to communicate with each other. Figure 19.5 shows five essential API functions that set up and tear down communication systems for an MPI application. Figure 19.6 shows a simple MPI program that uses these API functions. A user needs to supply the executable file of the program to the mpirun command or the mpiexec command in a cluster.
图 19.5用于建立和关闭通信系统的五个基本 MPI-API 函数。
Figure 19.5 Five basic MPI–API functions for establishing and closing a communication system.
图 19.6一个简单的 MPI 主程序。
Figure 19.6 A simple MPI main program.
每个进程首先通过MPI_Init()调用初始化 MPI 运行时。这会初始化运行应用程序的所有进程的通信系统。一旦 MPI 运行时初始化,每个进程都会调用两个函数来准备通信。第一个函数是MPI_Comm_rank(),它返回一个唯一的编号来调用每个进程,称为MPI 等级或进程 ID。进程收到的数字从 0 到进程数减 1 不等。进程的 MPI 等级相当于CUDA 线程的表达式blockIdx.x*blockDim.x+threadIdx.x 。它唯一地标识通信中的进程,类似于电话系统中的电话号码。
Each process starts by initializing the MPI runtime with a MPI_Init() call. This initializes the communication system for all the processes running the application. Once the MPI runtime is initialized, each process calls two functions to prepare for communication. The first function is MPI_Comm_rank(), which returns a unique number to calling each process, called an MPI rank or process ID. The numbers received by the processes vary from 0 to the number of processes minus 1. MPI rank for a process is equivalent to the expression blockIdx.x∗blockDim.x+threadIdx.x for a CUDA thread. It uniquely identifies the process in a communication, similar to the phone number in a telephone system.
MPI_Comm_rank ()采用两个参数。第一个是 MPI 内置类型MPI_Comm,它指定请求的范围。MPI_Comm的值通常称为通信器。MPI_Comm和其他 MPI 内置类型在mpi.h头文件中定义,该头文件应包含在所有使用 MPI 的 C 程序文件中。这类似于CUDA 程序的cuda.h头文件。 MPI 应用程序可以创建一个或多个内部通信器。每个内部通信器的成员都是 MPI 进程。MPI_Comm_rank()为内部通信器中的每个进程分配一个唯一的 ID。在图19.6中,传递的参数值为MPI_COMM_WORLD,这意味着内部通信器包括运行该应用程序的所有MPI进程。
The MPI_Comm_rank() takes two parameters. The first one is an MPI built-in type MPI_Comm that specifies the scope of the request. Values of the MPI_Comm are commonly referred to as a communicator. MPI_Comm and other MPI built-in types are defined in a mpi.h header file that should be included in all C program files that use MPI. This is similar to the cuda.h header file for CUDA programs. An MPI application can create one or more intracommunicators. Members of each intracommunicator are MPI processes. MPI_Comm_rank() assigns a unique ID to each process in an intracommunicator. In Figure 19.6, the parameter value passed is MPI_COMM_WORLD, which means that the intracommunicator includes all MPI processes running the application.
MPI_Comm_rank()函数的第二个参数是一个指向整型变量的指针,函数将把返回的等级值存入该整型变量中。在图 19.6中,为此目的声明了一个变量pid 。 MPI_Comm_rank()返回后,pid变量将包含调用进程的唯一 ID。
The second parameter to the MPI_Comm_rank() function is a pointer to an integer variable into which the function will deposit the returned rank value. In Figure 19.6, a variable pid is declared for this purpose. After the MPI_Comm_rank() returns, the pid variable will contain the unique ID for the calling process.
第二个 API 函数是MPI_Comm_size(),它返回内部通信器中运行的 MPI 进程总数。 MPI_Comm_size ()函数有两个参数。第一个是 MPI给出请求范围的内置类型MPI_Comm 。在图19.6中,范围是MPI_COMM_WORLD。由于我们使用MPI_COMM_WORLD,因此返回值是运行应用程序的 MPI 进程数。当使用mpirun命令或mpiexec命令提交应用程序时,用户会请求此操作。然而,用户可能没有请求足够数量的进程。此外,系统可能能够也可能无法创建所请求的所有进程。因此,MPI 应用程序检查实际运行的进程数是一个很好的做法。
The second API function is MPI_Comm_size(), which returns the total number of MPI processes running in the intracommunicator. The MPI_Comm_size() function takes two parameters. The first one is an MPI built-in type MPI_Comm that gives the scope of the request. In Figure 19.6, the scope is MPI_COMM_WORLD. Since we use MPI_COMM_WORLD, the returned value is the number of MPI processes running the application. This is requested by a user when the application is submitted using the mpirun command or the mpiexec command. However, the user may not have requested a sufficient number of processes. Also, the system may or may not be able to create all the processes requested. Therefore, it is good practice for an MPI application program to check the actual number of processes running.
第二个参数是一个指向整型变量的指针,MPI_Comm_size()函数将把返回值存入该整型变量中。在图 19.6中,为此目的声明了一个变量np 。函数返回后,变量np包含运行应用程序的 MPI 进程数。在图 19.6中,我们假设应用程序至少需要三个 MPI 进程。因此,它检查进程数是否至少为三个。如果没有,则调用MPI_Comm_abort()函数终止通信连接并返回错误标志值 1。
The second parameter is a pointer to an integer variable into which the MPI_Comm_size() function will deposit the return value. In Figure 19.6, a variable np is declared for this purpose. After the function returns, the variable np contains the number of MPI processes running the application. In Figure 19.6, we assume that the application requires at least three MPI processes. Therefore, it checks if the number of processes is at least three. If not, it calls the MPI_Comm_abort() function to terminate the communication connections and return with an error flag value of 1.
图 19.6还显示了报告错误或其他琐事的常见模式。有多个 MPI 进程,但我们只需要报告一次错误。应用程序代码指定pid=0 的进程来执行报告。
Figure 19.6 also shows a common pattern for reporting errors or other chores. There are multiple MPI processes but we need to report the error only once. The application code designates the process with pid=0 to do the reporting.
如图19.5所示,MPI_Comm_abort()函数有两个参数。首先是请求的范围。在图 19.6中,范围是运行应用程序的所有 MPI 进程。第二个参数是导致中止的错误类型的代码。 0 以外的任何数字都表示发生了错误。
As shown in Figure 19.5, the MPI_Comm_abort() function takes two parameters. The first is the scope of the request. In Figure 19.6, the scope is all MPI processes running the application. The second parameter is a code for the type of error that caused the abort. Any number other than 0 indicates that an error has happened.
如果进程数满足要求,则应用程序继续进行计算。在图19.6中,应用程序使用np-1个进程(pid从0到np-2)来执行计算,并使用一个进程(最后一个进程的pid为np -1)来执行输入/输出(I/O) )为其他进程服务。我们将执行 I/O 服务的进程称为数据服务器,将执行计算的进程称为计算进程。在图19.6中,如果进程的pid在0到np-2范围内,则它是一个计算进程并调用compute_process ()函数。如果进程pid为np-1,则它是数据服务器并调用data_server()函数。
If the number of processes satisfies the requirement, the application program goes on to perform the calculation. In Figure 19.6, the application uses np-1 processes (pid from 0 to np-2) to perform the calculation and one process (the last one of which the pid is np-1) to perform an input/output (I/O) service for the other processes. We will refer to the process that performs the I/O services as the data server and the processes that perform the calculation as compute processes. In Figure 19.6, if the pid of a process is within the range from 0 to np-2, it is a compute process and calls the compute_process() function. If the process pid is np-1, it is the data server and calls the data_server() function.
应用程序完成计算后,它会调用MPI_Finalize()来通知 MPI 运行时,这会释放分配给应用程序的所有 MPI 通信资源。然后应用程序可以退出并返回值 0,这表示没有发生错误。
After the application completes its computation, it notifies the MPI runtime with a call to the MPI_Finalize(), which frees all MPI communication resources allocated to the application. The application can then exit with a return value 0, which indicates that no error occurred.
MPI 支持两种主要类型的通信。第一种是点对点类型,涉及一个源进程和一个目标进程。源进程调用MPI_Send()函数,目标进程调用MPI_Recv()函数。这类似于电话系统中呼叫者拨打电话和接收者应答呼叫。
MPI supports two major types of communication. The first is the point-to-point type, which involves one source process and one destination process. The source process calls the MPI_Send() function and the destination process calls the MPI_Recv() function. This is analogous to a caller dialing a call and a receiver answering a call in a telephone system.
图 19.7显示了使用MPI_Send()函数的语法。第一个参数是一个指针,指向可以找到要发送的数据的内存区域的起始位置。第二个参数是一个整数,给出要发送的数据元素的数量。第三个参数是 MPI 内置类型MPI_Datatype。它指定发送的每个数据元素的类型。 MPI_Datatype在mpi.h中定义,包括MPI_DOUBLE(双精度、浮点)、MPI_FLOAT(单精度、浮点)、MPI_INT(整数)和MPI_CHAR(字符)。这些类型的确切大小取决于主机处理器中相应 C 类型的大小。有关 MPI 类型的更复杂用法,请参阅 MPI 编程协会[Gropp 1999]。
Figure 19.7 shows the syntax for using the MPI_Send() function. The first parameter is a pointer to the starting location of the memory area where the data to be sent can be found. The second parameter is an integer that gives that number of data elements to be sent. The third parameter is of an MPI built-in type MPI_Datatype. It specifies the type of each data element being sent. The MPI_Datatype is defined in mpi.h and includes MPI_DOUBLE (double precision, floating point), MPI_FLOAT (single precision, floating point), MPI_INT (integer), and MPI_CHAR (character). The exact sizes of these types depend on the size of the corresponding C types in the host processor. See the MPI programming guild for more sophisticated uses of MPI types [Gropp 1999].
图 19.7 MPI_Send()函数的语法。
Figure 19.7 Syntax for the MPI_Send() function.
MPI_Send的第四个参数是一个整数,给出目标进程的 MPI 等级。第五个参数给出一个标签,可用于对同一进程发送的消息进行分类。第六个参数是通信器,用于选择通信中要考虑的进程。
The fourth parameter for MPI_Send is an integer that gives the MPI rank of the destination process. The fifth parameter gives a tag that can be used to classify the messages sent by the same process. The sixth parameter is a communicator that selects the processes to be considered in the communication.
图 19.8显示了使用MPI_Recv()函数的语法。第一个参数是指向内存中应存放接收到的数据的区域的指针。第二个参数是一个整数,给出MPI_Recv()函数允许接收的最大元素数。第三个参数是MPI_Datatype,指定类型要接收的每个元素的(大小)。第四个参数是一个整数,给出消息源的进程 ID。
Figure 19.8 shows the syntax for using the MPI_Recv() function. The first parameter is a pointer to the area in memory where the received data should be deposited. The second parameter is an integer that gives the maximal number of elements that the MPI_Recv() function is allowed to receive. The third parameter is an MPI_Datatype that specifies the type (size) of each element to be received. The fourth parameter is an integer that gives the process ID of the source of the message.
图 19.8 MPI_Recv()函数的语法。
Figure 19.8 Syntax for the MPI_Recv() function.
第五个参数是一个整数,指定目标进程期望的特定标记值。如果目标进程不想被限制为特定的标记值,它可以使用MPI_ANY_TAG,这意味着接收者愿意接受来自源的任何标记值的消息。
The fifth parameter is an integer that specifies the particular tag value expected by the destination process. If the destination process does not want to be limited to a particular tag value, it can use MPI_ANY_TAG, which means that the receiver is willing to accept messages of any tag value from the source.
我们首先使用数据服务器来说明点对点通信的使用。在实际应用中,数据服务器进程通常会为计算进程执行数据输入和输出操作。然而,输入和输出具有太多依赖于系统的复杂性。由于I/O不是我们讨论的重点,因此我们将避免集群环境中I/O操作的复杂性。也就是说,我们不需要从文件系统读取数据,而是让数据服务器用随机数初始化数据并将数据分发到计算进程。数据服务器代码的第一部分如图19.9所示。
We will first use the data server to illustrate the use of point-to-point communication. In a real application, the data server process would typically perform data input and output operations for the compute processes. However, input and output have too much system-dependent complexity. Since I/O is not the focus of our discussion, we will avoid the complexity of I/O operations in a cluster environment. That is, instead of reading data from a file system, we will just have the data server initialize the data with random numbers and distribute the data to the compute processes. The first part of the data server code is shown in Figure 19.9.
图 19.9数据服务器进程代码(第 1 部分)。
Figure 19.9 Data server process code (part 1).
数据服务器函数有四个参数。前三个参数指定 3D 网格的大小:x维度中的元素数量为dimx , y维度中的元素数量为dimy , z维度中的元素数量为dimz。第四个参数指定网格中所有数据点需要完成的迭代次数。
The data server function takes four parameters. The first three parameters specify the size of the 3D grid: number of elements in the x dimension is dimx, the number of elements in the y dimension is dimy, and the number of elements in the z dimension is dimz. The fourth parameter specifies the number of iterations that need to be done for all the data points in the grid.
在图 19.9中,第 1 行声明了变量np,它将包含运行应用程序的进程数。第 2 行调用MPI_Comm_size(),它将信息存入np中。第 3 行声明并初始化了几个辅助变量。变量num_comp_procs包含计算进程的数量。由于我们保留一个进程作为数据服务器,因此有np-1 个计算进程。变量first_proc给出第一个计算进程的进程ID,为0。变量last_proc给出最后一个计算进程的进程ID,为np-2。也就是说,第 3 行将第一个np-1进程(0 到np-2)指定为计算进程。这反映了设计决策,并且排名最高的进程充当数据服务器。这一决定也将反映在计算过程代码中。
In Figure 19.9, line 1 declares variable np that will contain the number of processes running the application. Line 2 calls MPI_Comm_size(), which will deposit the information into np. Line 3 declares and initializes several helper variables. The variable num_comp_procs contains the number of compute processes. Since we are reserving one process as the data server, there are np-1 compute processes. The variable first_proc gives the process ID of the first compute process, which is 0. The variable last_proc gives the process ID of the last compute process, which is np-2. That is, line 3 designates the first np-1 processes, 0 through np-2, as compute processes. This reflects the design decision and the process with the largest rank serves as the data server. This decision will also be reflected in the compute process code.
第 4 行声明并初始化num_points变量,该变量给出要处理的网格数据点的总数,它只是每个维度中元素数量的乘积,或dimx * dimy * dimz。第 5 行声明并初始化num_bytes变量,该变量给出存储所有网格数据点所需的字节总数。由于每个网格数据点都是浮点数,因此该值为num_points * sizeof(float)。
Line 4 declares and initializes the num_points variable that gives the total number of grid data points to be processed, which is simply the product of the number of elements in each dimension, or dimx ∗ dimy ∗ dimz. Line 5 declares and initializes the num_bytes variable that gives the total number of bytes needed to store all the grid data points. Since each grid data point is a float, this value is num_points ∗ sizeof(float).
第 6 行声明了两个指针变量:input和output。这两个指针将分别指向输入数据缓冲区和输出数据缓冲区。第 7 行和第 8 行为输入和输出缓冲区分配内存,并将它们的地址分配给各自的指针。第 9 行检查内存分配是否成功。如果任一内存分配失败,相应的指针将从malloc()函数接收到 NULL 指针。在这种情况下,代码会中止应用程序并报告错误。
Line 6 declares two pointer variables: input and output. These two pointers will point to the input data buffer and the output data buffer, respectively. Lines 7 and 8 allocate memory for the input and output buffers and assign their addresses to their respective pointers. Line 9 checks if the memory allocations were successful. If either of the memory allocations fail, the corresponding pointer will receive a NULL pointer from the malloc() function. In this case, the code aborts the application and reports an error.
第 11 行和第 12 行计算应发送到每个计算进程的网格点数组元素的数量。如图19.3所示,有两种类型的计算过程。第一个进程(进程 0)和最后一个进程(进程 3)计算仅一侧有邻居的“边缘”分区。分配给第一个进程的分区 0 仅在右侧有一个邻居(分区 1)。分配给最后一个进程的分区 3 仅在左侧有一个邻居(分区 2)。我们将计算边缘分区的计算过程称为边缘过程。
Lines 11 and 12 calculate the number of grid point array elements that should be sent to each compute process. As shown in Figure 19.3, there are two types of compute processes. The first process (process 0) and the last process (process 3) compute an “edge” partition that has neighbors only on one side. Partition 0 assigned to the first process has a neighbor only on the right side (partition 1). Partition 3 assigned to the last process has a neighbor only on the left side (partition 2). We call the compute processes that compute edge partitions the edge processes.
其余的每个进程都会计算一个内部分区,该分区具有两种大小的邻居。例如,第二个进程(进程 1)计算具有左邻居(分区 0)和右邻居(分区 2)的分区(分区 1)。我们将计算内部分区的进程称为内部进程。
Each of the rest of the processes computes an internal partition that has neighbors on both sizes. For example, the second process (process 1) computes a partition (partition 1) that has a left neighbor (partition 0) and a right neighbor (partition 2). We call the processes that compute internal partitions internal processes.
回想一下,网格点的每个计算步骤都需要上一步中其直接邻居的值。这就需要在分区的左右边界处设置网格点的光晕单元,如图19.3中每个分区边缘的虚线所定义的切片。请注意,这些晕圈单元与第 8 章中介绍的卷积模式中的晕圈单元类似。因此,每个进程还需要接收一个晕环单元切片,其中包含其分区的边界网格点的所有邻居。例如,在图 19.3中,分区 D2 需要来自 D1 的晕圈切片和来自 D3 的晕圈切片。请注意,D2 的光晕切片是 D1 或 D3 的边界切片。
Recall that each calculation step for a grid point needs the values of its immediate neighbors from the previous step. This creates a need for halo cells for grid points at the left and right boundaries of a partition, shown as slices defined by dotted lines at the edge of each partition in Figure 19.3. Note that these halo cells are similar to those in the convolution pattern presented in Chapter 8. Therefore, each process also needs to receive one slice of halo cells that contains all neighbors for the boundary grid points of its partition. For example, in Figure 19.3, partition D2 needs a halo slice from D1 and a halo slice from D3. Note that a halo slice for D2 is a boundary slice for D1 or D3.
回想一下,网格点的总数是dimx*dimy*dimz。由于我们沿z维度对网格进行分区,因此每个分区中的网格点数应为dimx*dimy*(dimz / num_comp_procs)。回想一下,我们在每个方向上需要四个相邻切片来计算每个切片内的值。因为我们需要为每个邻居发送四片网格点,所以应该发送到每个内部进程的网格点数量应该是dimx*dimy*(dimz/num_comp_procs + 8)。对于一个边缘进程来说,只有一个邻居。与卷积的情况一样,我们假设零值将用于幽灵单元,并且不需要为它们发送输入数据。例如,分区 D1 仅需要右侧 D2 的邻居切片。因此,要发送到边缘进程的网格点的数量应该是dimx*dimy*(dimz/num_comp_procs+4)。也就是说,每个进程从每侧的相邻分区接收四片光环网格点。
Recall that the total number of grid points is dimx∗dimy∗dimz. Since we are partitioning the grid along the z dimension, the number of grid points in each partition should be dimx∗dimy∗(dimz / num_comp_procs). Recall that we will need four neighbor slices in each direction to calculate values within each slice. Because we need to send four slices of grid points for each neighbor, the number of grid points that should be sent to each internal process should be dimx∗dimy∗(dimz/num_comp_procs + 8). As for an edge process, there is only one neighbor. Like in the case of convolution, we assume that zero values will be used for the ghost cells and no input data needs to be sent for them. For example, partition D1 only needs the neighbor slice from D2 on the right side. Therefore, the number of grid points to be sent to an edge process should be dimx∗dimy∗(dimz/num_comp_procs+4). That is, each process receives four slices of halo grid points from the neighbor partition on each side.
图 19.9的第 13行将send_address指针设置为指向输入网格点数组的开头。发送适当的分区对于每个进程,我们需要为每个MPI_Send()的起始地址添加适当的偏移量。我们稍后会回到这一点。
Line 13 of Figure 19.9 sets the send_address pointer to point to the beginning of the input grid point array. To send the appropriate partition to each process, we need to add the appropriate offset to this beginning address for each MPI_Send(). We will come back to this point later.
现在我们已经准备好完成数据服务器的代码,如图 19.10所示。第 14 行向进程 0 发送其分区。由于这是第一个分区,因此它的起始地址也是整个网格的起始地址,这是在第13行设置的。进程0是一个边缘进程,它没有左邻居。因此,要发送的网格点的数量为值edge_num_points,即dimx*dimy*(dimz/num_comp_procs +4)。第三个参数指定每个元素的类型是MPI_FLOAT,它是 C浮点数(单精度,4 字节)。第四个参数指定first_node的值(即0)是目标进程的MPI 等级。第五个参数指定 MPI 标记为 0。这是因为我们没有使用标签来区分从数据服务器发送的消息。第六个参数指定用于发送消息的通信器应该是当前应用程序的所有MPI进程。
We are now ready to complete the code for the data server, shown in Figure 19.10. Line 14 sends process 0 its partition. Since this is the first partition, its starting address is also the starting address of the entire grid, which was set up in line 13. Process 0 is an edge process and it does not have a left neighbor. Therefore, the number of grid points to be sent is the value edge_num_points, that is, dimx∗dimy∗(dimz/num_comp_procs +4). The third parameter specifies that the type of each element is an MPI_FLOAT, which is a C float (single precision, 4 bytes). The fourth parameter specifies that the value of first_node (i.e., 0) is the MPI rank of the destination process. The fifth parameter specifies 0 for the MPI tag. This is because we are not using tags to distinguish between messages sent from the data server. The sixth parameter specifies that the communicator to be used for sending the message should be all MPI processes for the current application.
图 19.10数据服务器进程代码(第 2 部分)。
Figure 19.10 Data server process code (part 2).
图 19.10的第 15 行将send_address指针前进到要发送到进程 1 的数据的开头。从图 19.3中,分区 D1 中有dimx*dimy*(dimz/num_comp_procs)元素,这意味着 D2 从以下位置开始:是来自input起始位置的dimx*dimy*(dimz/num_comp_procs)个元素。回想一下,我们还需要从 D1 发送光环单元。因此,我们将MPI_Send()的起始地址向后调整四个切片,从而得到第 15 行中用于推进send_address指针的表达式:dimx*dimy*(dimz/num_comp_procs-4)。
Line 15 of Figure 19.10 advances the send_address pointer to the beginning of the data to be sent to process 1. From Figure 19.3, there are dimx∗dimy∗(dimz/num_comp_procs) elements in partition D1, which means D2 starts at a location that is dimx∗dimy∗(dimz/num_comp_procs) elements from the starting location of input. Recall that we also need to send the halo cells from D1 as well. Therefore, we adjust the starting address for the MPI_Send() back by four slices, which results in the expression for advancing the send_address pointer in line 15: dimx∗dimy∗(dimz/num_comp_procs-4).
第 16 行是一个循环,通过进程np-3向进程 1 发送 MPI 消息。在我们的四个计算进程的小示例中,np为 5。循环将 MPI 消息发送到进程 1、2 和 3。这些是内部进程,需要接收两侧邻居的光环网格点。因此,第 17 行MPI_Send()的第二个参数使用int_num_nodes,即dimx*dimy*(dimz/num_comp_procs+8)。其余参数与第 14 行中MPI_Send()的参数类似,但明显的例外是目标进程由循环变量process指定,该变量从 1 递增到np-3(last_node为np-2)。
Line 16 is a loop that sends out the MPI messages to process 1 through process np-3. In our small example for four compute processes, np is 5. The loop sends the MPI messages to processes 1, 2, and 3. These are internal processes that need to receive halo grid points for neighbors on both sides. Therefore, the second parameter of the MPI_Send() in line 17 uses int_num_nodes, that is, dimx∗dimy∗(dimz/num_comp_procs+8). The rest of the parameters are similar to that for the MPI_Send() in line 14 with the obvious exception that the destination process is specified by the loop variable process, which is incremented from 1 to np-3 (last_node is np-2).
第 18 行将每个内部进程的发送地址前进每个分区中网格点的数量:dimx*dimy*dimz/num_comp_nodes。请注意,内部进程的光环网格点的起始位置相隔为dimx*dimy*dimz/num_comp_procs点。尽管我们需要将起始地址拉回四个切片以容纳光环网格点,但我们对每个内部进程都这样做,因此起始位置之间的净距离保持为每个分区中网格点的数量。
Line 18 advances the send address for each internal process by the number of grid points in each partition: dimx∗dimy∗dimz/num_comp_nodes. Note that the starting locations of the halo grid points for internal processes are dimx∗dimy∗dimz/num_comp_procs points apart. Although we need to pull back the starting address by four slices to accommodate halo grid points, we do so for every internal process so the net distance between the starting locations remains as the number of grid points in each partition.
第 19 行将数据发送到进程np-2,这是最后一个计算进程,左侧只有一个邻居。读者应该能够推理出所使用的所有参数值。请注意,我们还没有完全完成数据服务器代码。我们稍后将回来讨论数据服务器的最后部分,该部分收集所有计算进程的输出值。
Line 19 sends the data to the process np-2, the last compute process that has only one neighbor to the left. Readers should be able to reason through all the parameter values used. Note that we are not quite done with the data server code. We will come back later for the final part of the data server that collects the output values from all compute processes.
现在,我们将注意力转向从数据服务器进程接收输入的计算进程。在图 19.11中,第 1 行和第 2 行建立了进程的进程 ID 以及应用程序的进程总数。第 3 行确定数据服务器是进程np-1。第 4 行和第 5 行计算网格点的数量以及每个内部进程应处理的字节数。第 6 行和第 7 行计算网格点的数量和每个光环(四个切片)中的字节数。
We now turn our attention to the compute processes that receive the input from the data server process. In Figure 19.11, lines 1 and 2 establish the process ID for the process and the total number of processes for the application. Line 3 establishes that the data server is process np-1. Lines 4 and 5 calculate the number of grid points and the number of bytes that should be processed by each internal process. Lines 6 and 7 calculate the number of grid points and the number of bytes in each halo (four slices).
图 19.11计算过程代码(第 1 部分)。
Figure 19.11 Compute process code (part 1).
第 8-10 行为输入数据分配主机内存和设备内存。请注意,尽管边缘进程需要较少的光环数据,但为了简单起见,它们仍然分配相同数量的内存;部分分配的内存将不会被边缘进程使用。第10行设置主机存储器的起始地址,用于接收来自数据服务器的输入数据。对于除进程 0 之外的所有计算进程,起始接收位置只是为输入数据分配的内存的起始位置。然而,我们将接收位置调整了四片。这是因为为了简单起见,我们假设用于接收输入数据的主机内存对于所有计算进程都以相同的方式排列:来自左邻居的四个光环片,后面是分区,后面是来自右邻居的四片光环。然而,我们在图19.10的第4行中表明,数据服务器不会将任何来自左邻居的光环数据发送到进程0。也就是说,对于进程0,来自数据服务器的MPI消息仅包含来自左邻居的分区和光环数据。正确的邻居。因此,第 10 行将起始主机内存位置调整了四个片,以便进程 0 能够正确解释来自数据服务器的输入数据。
Lines 8-10 allocate the host memory and device memory for the input data. Note that although the edge processes need less halo data, they still allocate the same amount of memory for simplicity; part of the allocated memory will not be used by the edge processes. Line 10 sets the starting address of the host memory for receiving the input data from the data server. For all compute processes except process 0, the starting receiving location is simply the starting location of the allocated memory for the input data. However, we adjust the receiving location by four slices. This is because for simplicity, we assume that the host memory for receiving the input data is arranged the same way for all compute processes: four slices of halo from the left neighbor followed by the partition, followed by four slices of halo from the right neighbor. However, we showed in line 4 of Figure 19.10 that the data server will not send any halo data from the left neighbor to process 0. That is, for process 0, the MPI message from the data server only contains the partition and the halo from the right neighbor. Therefore, line 10 adjusts the starting host memory location by four slices so that process 0 will correctly interpret the input data from the data server.
第 12 行接收来自数据服务器的 MPI 消息。大多数参数应该是熟悉的。最后一个参数反映了接收数据时发生的任何错误情况。第二个参数指定所有计算进程将从数据服务器接收全部数据。但是,数据服务器将向进程 0 和进程np-2发送较少的数据。这在代码中没有反映出来,因为MPI_Recv()允许第二个参数指定比实际接收到的数据点数量更多的数据点,并且只会将从发送方接收到的实际字节数放入接收内存中。在进程0的情况下,来自数据服务器的输入数据仅包含来自右邻居的分区和光环。接收到的输入将通过跳过已分配内存的前四个切片来放置,这应该对应于(不存在的)左邻居的光环。这种效果是通过以下方式实现的第 11 行中的项num_halo_points*(pid==0)。在进程np-2的情况下,输入数据包含来自左邻居和分区的光环。接收到的输入将从已分配内存的开头放置,而已分配内存的最后四个片未使用。
Line 12 receives the MPI message from the data server. Most of the parameters should be familiar. The last parameter reflects any error condition that occurred when the data is received. The second parameter specifies that all compute processes will receive the full amount of data from the data server. However, the data server will send less data to process 0 and process np-2. This is not reflected in the code because MPI_Recv() allows the second parameter to specify a larger number of data points than what is actually received, and will only place the actual number of bytes received from the sender into the receiving memory. In the case of process 0, the input data from the data server contains only the partition and the halo from the right neighbor. The received input will be placed by skipping the first four slices of the allocated memory, which should correspond to the halo for the (nonexistent) left neighbor. This effect is achieved with the term num_halo_points∗(pid==0) in line 11. In the case of process np-2, the input data contains the halo from the left neighbor and the partition. The received input will be placed from the beginning of the allocated memory, leaving the last four slices of the allocated memory unused.
第 13 行将接收到的输入数据复制到设备内存中。在过程0的情况下,左晕点无效。在过程np-2的情况下,右侧光环点无效。但是,为了简单起见,所有计算节点都会将完整大小发送到设备内存。假设内核将以正确忽略这些无效部分的方式启动。第 13 行之后,所有输入数据都在设备内存中。
Line 13 copies the received input data to the device memory. In the case of process 0, the left halo points are not valid. In the case of process np-2, the right halo points are not valid. However, for simplicity, all compute nodes send the full size to the device memory. The assumption is that the kernels will be launched in such a way that these invalid portions will be correctly ignored. After line 13, all the input data is in the device memory.
图 19.12显示了计算过程代码的第 2 部分。第 14-16 行为输出数据分配主机内存和设备内存。设备内存中的输出数据缓冲区实际上将用作与输入数据缓冲区的乒乓缓冲区。也就是说,他们会在每次迭代中切换角色。我们稍后会回到这一点。
Figure 19.12 shows part 2 of the compute process code. Lines 14-16 allocate host memory and device memory for the output data. The output data buffer in the device memory will actually be used as a ping-pong buffer with the input data buffer. That is, they will switch roles in each iteration. We will return to this point later.
图 19.12重叠计算与通信的两阶段策略。
Figure 19.12 A two-stage strategy for overlapping computation with communication.
我们现在准备好展示在网格点上执行计算步骤的代码。
We are now ready to present the code that performs computation steps on the grid points.
执行计算步骤的一种简单方法是每个计算进程在其整个分区上执行计算步骤,与左右邻居交换光环数据,然后重复。虽然这是一个非常简单的策略,但并不是很有效。原因是该策略迫使系统处于两种模式之一。在第一种模式中,所有计算进程都执行计算步骤。在此期间,不使用通信网络。在第二种模式下,所有计算进程与其左右邻居交换光环数据。在此期间,计算硬件没有得到很好的利用。理想情况下,我们希望始终利用通信网络和计算硬件来实现更好的性能。这可以通过将每个计算进程的计算任务分为两个阶段来实现,如图19.13所示。
A simple way to perform the computation steps is for each compute process to perform a computation step on its entire partition, exchange halo data with the left and right neighbors, and repeat. While this is a very simple strategy, it is not very effective. The reason is that this strategy forces the system to be in one of the two modes. In the first mode, all compute processes are performing computation steps. During this time, the communication network is not used. In the second mode, all compute processes exchange halo data with their left and right neighbors. During this time, the computation hardware is not well utilized. Ideally, we would like to achieve better performance by utilizing both the communication network and computation hardware all the time. This can be achieved by dividing the computation tasks of each compute process into two stages, as illustrated in Figure 19.13.
图 19.13计算过程代码(第 2 部分)。
Figure 19.13 Compute process code (part 2).
在第一阶段(第 1 阶段)期间,每个计算进程都会计算其边界切片,在下一次迭代中,其邻居将需要这些边界切片作为光环单元。我们继续假设我们使用四片光环数据。图 19.13显示四个光环切片的集合作为虚线透明块,四个边界切片作为实心块。请注意,实线部分的流程i将被复制到虚线部分的流程中在下次通信时i+1反之亦然。对于过程 0,第一阶段为四个计算步骤计算正确的四个边界数据切片。对于内部节点,它计算其边界数据的左四个切片和右四个切片。对于进程n-2,它计算其边界数据的右侧四个部分。基本原理是这些边界切片是其邻居下一次迭代所需要的。通过首先计算这些边界切片,可以将数据传送给邻居,同时计算进程计算其其余网格点。
During the first stage (stage 1), each compute process calculates its boundary slices that will be needed as halo cells by its neighbors in the next iteration. Let’s continue to assume that we use four slices of halo data. Figure 19.13 shows that the collection of four halo slices as a dashed transparent piece and the four boundary slices as a solid piece. Note that the solid piece of process i will be copied into the dashed piece of process i+1 and vice versa during the next communication. For process 0, the first phase calculates the right four slices of boundary data for four computation steps. For an internal node, it calculates the left four slices and the right four slices of its boundary data. For process n-2, it calculates the right four pieces of its boundary data. The rationale is that these boundary slices are needed by their neighbors for the next iteration. By calculating these boundary slices first, the data can be communicated to the neighbors while the compute processes calculate the rest of its grid points.
在第二阶段(阶段 2),每个计算进程执行两个并行活动。第一个是将其新边界值传达给其邻居进程。这是通过首先将数据从设备内存复制到主机内存,然后向邻居发送 MPI 消息来完成的。正如我们稍后将讨论的,我们需要注意从邻居接收到的数据用于下一次迭代,而不是当前迭代。第二个活动是计算分区中的其余数据。如果通信活动比计算活动花费的时间短,我们就可以隐藏通信延迟并始终充分利用计算硬件。这通常是通过在每个分区的内部拥有足够的切片来实现的,允许每个计算进程在通信之间执行计算步骤。
During the second stage (stage 2), each compute process performs two parallel activities. The first is to communicate its new boundary values to its neighbor processes. This is done by first copying the data from the device memory into the host memory, followed by sending MPI messages to the neighbors. As we will discuss later, we need to be careful that the data received from the neighbors is used in the next iteration, not the current iteration. The second activity is to calculate the rest of the data in the partition. If the communication activity takes a shorter amount of time than the calculation activity, we can hide the communication delay and fully utilize the computing hardware all the time. This is usually achieved by having enough slices in the internal part of each partition allow each compute process to perform computation steps in between communications.
为了支持第 2 阶段的并行活动,我们需要使用 CUDA 编程模型的两个高级功能:固定内存分配和流。固定内存分配请求内存分配的内存不会被操作系统调出。这是通过cudaHostAlloc() API 调用完成的。第19-22行为左右边界切片和左右光晕切片分配内存缓冲区。左边界切片和右边界切片需要从设备内存发送到左邻居进程和右邻居进程。这些缓冲区用作设备将数据复制到其中的主机内存暂存区域,然后用作MPI_Send()到相邻进程的源缓冲区。左、右晕圈切片需要从相邻进程接收。这些缓冲区用作MPI_Recv() 的主机内存暂存区域,以用作目标缓冲区,然后复制到设备内存。
To support the parallel activities in stage 2, we need to use two advanced features of the CUDA programming model: pinned memory allocation and streams. A pinned memory allocation requests that the memory allocated will not be paged out by the operating system. This is done with the cudaHostAlloc() API call. Lines 19-22 allocate memory buffers for the left and right boundary slices and the left and right halo slices. The left and right boundary slices need to be sent from the device memory to the left and right neighbor processes. The buffers are used as a host memory staging area for the device to copy data into, and then used as the source buffer for MPI_Send() to neighbor processes. The left and right halo slices need to be received from neighbor processes. The buffers are used as a host memory staging area for MPI_Recv() to use as a destination buffer and then copied to the device memory.
请注意,主机内存分配是通过cudaHostAlloc()函数而不是标准malloc()函数完成的。不同之处在于cudaHostAlloc()函数分配固定内存缓冲区,有时也称为页锁定内存缓冲区。我们需要提供更多有关操作系统内存管理的背景知识,以充分理解固定内存缓冲区的概念。
Note that the host memory allocation is done with the cudaHostAlloc() function rather than the standard malloc() function. The difference is that the cudaHostAlloc() function allocates a pinned memory buffer, sometimes also referred to as page-locked memory buffer. We need to present a little more background on the memory management in operating systems to fully understand the concept of pinned memory buffers.
在现代计算机系统中,操作系统管理应用程序的虚拟内存空间。每个应用程序都可以访问大的、连续的地址空间。实际上,系统的物理内存数量有限,需要在所有正在运行的应用程序之间共享。这种共享是通过将虚拟内存空间划分为页面并仅将活跃使用的页面映射到物理内存来执行的。当内存需求较多时,操作系统需要将部分页面从物理内存“换出”到磁盘等大容量存储中。因此,应用程序可以在其执行期间的任何时间将其数据调出。
In a modern computer system, the operating system manages a virtual memory space for applications. Each application has access to a large, consecutive address space. In reality, the system has a limited amount of physical memory that needs to be shared among all running applications. This sharing is performed by partitioning the virtual memory space into pages and mapping only the actively used pages into physical memory. When there is much demand for memory, the operating system needs to “page out” some of the pages from the physical memory to mass storage such as disks. Therefore, an application may have its data paged out any time during its execution.
cudaMemcpy()的实现使用一种称为直接内存访问 (DMA) 设备的硬件。当调用cudaMemcpy()函数在主机和设备内存之间进行复制时,其实现使用 DMA 设备来完成任务。在主机内存方面,DMA 硬件对物理地址进行操作。也就是说,操作系统需要给DMA设备一个翻译后的物理地址。然而,数据有可能在 DMA 操作完成之前被调出。数据的物理存储器位置可以被重新分配给另一个虚拟存储器数据。在这种情况下,DMA 操作可能会被损坏,因为其数据可能会被分页活动覆盖。
The implementation of cudaMemcpy() uses a type of hardware called a direct memory access (DMA) device. When a cudaMemcpy() function is called to copy between the host and device memories, its implementation uses a DMA device to complete the task. On the host memory side, the DMA hardware operates on physical addresses. That is, the operating system needs to give a translated physical address to the DMA device. However, there is a chance that the data may be paged out before the DMA operation is complete. The physical memory locations for the data may be reassigned to another virtual memory data. In this case, the DMA operation can be potentially corrupted since its data can be overwritten by the paging activity.
此损坏问题的常见解决方案是让 CUDA 运行时分两步执行复制操作。对于主机到设备的复制,CUDA运行时首先将源主机内存数据复制到“固定”内存缓冲区中,这意味着内存位置被标记,以便操作分页机制不会分页数据。然后,它使用 DMA 设备将数据从固定内存缓冲区复制到设备内存。对于设备到主机的复制,CUDA 运行时首先使用 DMA 设备将数据从设备内存复制到固定内存缓冲区中。然后,它将数据从固定内存复制到目标主机内存位置。通过使用额外的固定内存缓冲区,DMA 副本将不会受到任何分页活动的影响。
A common solution to this corruption problem is for the CUDA runtime to perform the copy operation in two steps. For a host-to-device copy, the CUDA runtime first copies the source host memory data into a “pinned” memory buffer, which means the memory locations are marked so that the operating paging mechanism will not page out the data. It then uses the DMA device to copy the data from the pinned memory buffer to the device memory. For a device-to-host copy, the CUDA runtime first uses a DMA device to copy the data from the device memory into a pinned memory buffer. It then copies the data from the pinned memory to the destination host memory location. By using an extra pinned memory buffer, the DMA copy will be safe from any paging activities.
这种方法有两个问题。一是额外的副本会增加cudaMemcpy()操作的延迟。第二个是所涉及的额外复杂性导致cudaMemcpy()函数的同步实现。也就是说,在cudaMemcpy()函数完成其操作并返回之前,主机程序无法继续执行。这会序列化所有复制操作。为了支持具有更多并行性的快速复制,CUDA 提供了cudaMemcpyAsync()函数。
There are two problems with this approach. One is that the extra copy adds delay to the cudaMemcpy() operation. The second is that the extra complexity involved leads to a synchronous implementation of the cudaMemcpy() function. That is, the host program cannot continue to execute until the cudaMemcpy() function completes its operation and returns. This serializes all copy operations. To support fast copies with more parallelism, CUDA provides a cudaMemcpyAsync() function.
要使用cudaMemcpyAsync()函数,必须将主机内存缓冲区分配为固定内存缓冲区。这是在第 19-22 行中针对左边界、右边界、左光环和右光环切片的主机存储器缓冲区完成的。这些缓冲区是使用特殊的cudaHostAlloc()函数进行分配的,该函数可确保分配的内存在分页活动中被固定或页面锁定。请注意,cudaHostAlloc()函数采用三个参数。前两个与cudaMalloc()相同。第三个指定了一些更高级用法的选项。对于大多数基本用例,我们可以简单地使用默认值cudaHostAllocDefault。
To use the cudaMemcpyAsync() function, the host memory buffer must be allocated as a pinned memory buffer. This is done in lines 19-22 for the host memory buffers of the left boundary, right boundary, left halo, and right halo slices. These buffers are allocated with a special cudaHostAlloc() function, which ensures that the allocated memory is pinned or page-locked from paging activities. Note that the cudaHostAlloc() function takes three parameters. The first two are the same as cudaMalloc(). The third specifies some options for more advanced usage. For most basic use cases, we can simply use the default value cudaHostAllocDefault.
第二个高级 CUDA 功能是Streams,它支持 CUDA API 函数的托管并发执行。流是一个有序的操作序列。当主机代码调用cudaMemcpyAsync()函数或启动内核时,它可以指定流作为其参数之一。同一流中的所有操作将按顺序完成。来自两个不同流的操作可以并行执行。
The second advanced CUDA feature is streams, which supports the managed concurrent execution of CUDA API functions. A stream is an ordered sequence of operations. When a host code calls a cudaMemcpyAsync() function or launches a kernel, it can specify a stream as one of its parameters. All operations in the same stream will be done sequentially. Operations from two different streams can be executed in parallel.
图 19.13的第 23 行声明了两个 CUDA 内置类型cudaStream_t的变量。回想一下,CUDA 内置类型是在cuda.h中声明的。然后使用这些变量来调用cudaStreamCreate()函数。每次调用 cudaStreamCreate ()都会创建一个新流,并将指向该流的指针存入其参数中。在第 24 行和第 25 行中的调用之后,主机代码可以在后续的cudaMemcpyAsync()调用和内核启动中使用stream0或stream1 。
Line 23 of Figure 19.13 declares two variables that are of CUDA built-in type cudaStream_t. Recall that the CUDA built-in types are declared in cuda.h. These variables are then used in calling the cudaStreamCreate() function. Each call to the cudaStreamCreate() creates a new stream and deposits a pointer to the stream into its parameter. After the calls in lines 24 and 25, the host code can use either stream0 or stream1 in subsequent cudaMemcpyAsync() calls and kernel launches.
图 19.14显示了计算过程的第 3 部分。第 27 和 28 行计算计算进程的左邻居和右邻居的进程 ID。当计算进程向邻居发送消息或从邻居接收消息时,将使用 left_neighbor 和 right_neighbor 变量作为参数。对于进程 0,没有左邻居,因此第 27 行将 MPI 常量MPI_PROC_NULL分配给left_neighbor来记录这一事实。对于进程np-2,没有右邻居,因此第 28 行将MPI_PROC_NULL分配给right_neighbor。对于所有内部进程,第 27 行将pid-1分配给left_neighbor,将pid+1分配给right_neighbor。
Figure 19.14 shows part 3 of the compute process. Lines 27 and 28 calculate the process ID of the left and right neighbors of the compute process. The left_neighbor and right_neighbor variables will be used by compute processes as parameters when they send messages to and receive messages from their neighbors. For process 0, there is no left neighbor, so line 27 assigns an MPI constant MPI_PROC_NULL to left_neighbor to note this fact. For process np-2, there is no right neighbor, so line 28 assigns MPI_PROC_NULL to right_neighbor. For all the internal processes, line 27 assigns pid-1 to left_neighbor and pid+1 to right_neighbor.
图 19.14计算过程代码(第 3 部分)。
Figure 19.14 Compute process code (part 3).
第 31-33 行设置了几个偏移量,这些偏移量将用于启动内核和交换数据,以便计算和通信可以重叠。这些偏移量定义了图 19.12中每个阶段需要计算的网格点区域。图 19.15也对它们进行了可视化。请注意,每个设备内存中的切片总数为四片左光晕点(白色虚线),加上四片左边界点,加上dimx*dimy*(dimz-8)内部点,加上四片边界点,以及四片右晕点(白色虚线)。变量left_stage1_offset定义计算左边界切片所需的切片的起点。这包括 12 个数据切片:4 个左邻居光晕点切片、4 个边界点切片和 4 个内部点切片。这些切片位于分区的最左侧,因此在第 31 行中将偏移值设置为 0。变量right_stage2_offset定义计算右边界切片所需的切片的起点。这还包括 12 个切片:4 个内部点切片、4 个右边界点切片和 4 个右晕环单元切片。这 12 个切片的起点可以通过减去切片总数dimz+8乘以 12。因此,这 12 个切片的起始偏移量为dimx*dimy*(dimz-4)。
Lines 31-33 set up several offsets that will be used to launch kernels and exchange data so that the computation and communication can be overlapped. These offsets define the regions of grid points that will need to be calculated at each stage of Figure 19.12. They are also visualized in Figure 19.15. Note that the total number of slices in each device memory is four slices of left halo points (dashed white), plus four slices of left boundary points, plus dimx∗dimy∗(dimz-8) internal points, plus four slices of boundary points, and four slices of right halo points (dashed white). Variable left_stage1_offset defines the starting point of the slices that are needed to calculate the left boundary slices. This includes 12 slices of data: 4 slices of left neighbor halo points, 4 slices of boundary points, and 4 slices of internal points. These slices are the leftmost in the partition so the offset value is set to 0 in line 31. Variable right_stage2_offset defines the starting point of the slices that are needed for calculating the right boundary slices. This also includes 12 slices: 4 slices of internal points, 4 slices of right boundary points, and 4 slices of right halo cells. The beginning point of these 12 slices can be derived by subtracting the total number of slices dimz+8 by 12. Therefore, the starting offset for these 12 slices is dimx∗dimy∗(dimz-4).
图 19.15用于与邻居进程进行数据交换的设备内存偏移量。
Figure 19.15 Device memory offsets used for data exchange with neighbor processes.
第35行是MPI屏障同步,与CUDA_syncthreads ()类似。 MPI 屏障强制参数指定的所有 MPI 进程相互等待。在每个进程都到达这一点之前,没有任何进程可以继续执行超出该点的操作。我们在这里想要屏障同步的原因是为了确保所有计算节点都已收到其输入数据并准备好执行计算步骤。由于它们将相互交换数据,因此我们希望使它们全部在大约同一时间开始。这样,我们就不会出现在数据交换过程中少数迟缓进程延迟所有其他进程的情况。MPI_Barrier()是一个集体通信函数。我们将在下一节中讨论有关集体通信 API 函数的更多细节。
Line 35 is an MPI barrier synchronization, which is similar to the CUDA_syncthreads(). An MPI barrier forces all MPI processes specified by the parameter to wait for each other. None of the processes can continue their execution beyond this point until everyone has reached this point. The reason why we want a barrier synchronization here is to make sure that all compute nodes have received their input data and are ready to perform the computation steps. Since they will be exchanging data with each other, we would like to make them all start at about the same time. This way, we will not be in a situation where a few tardy processes delay all other processes during the data exchange. MPI_Barrier() is a collective communication function. We will discuss more details about collective communication API functions in the next section.
第 35 行启动一个执行计算步骤的循环。对于每次迭代,每个计算过程将执行图 19.12中两阶段过程的一个周期。
Line 35 starts a loop that performs the computation steps. For each iteration, each compute process will perform one cycle of the two-stage process in Figure 19.12.
第 36 行调用一个函数,该函数将执行四个计算步骤来生成阶段 1 中左边界点的四个切片。我们假设有一个内核在抓点区域上执行一个计算步骤。 launch_kernel ()函数有几个参数。第一个参数是指向内核输出数据区域的指针。第二个参数是指向输入数据区域的指针。在这两种情况下,我们都将left_stage1_offset添加到设备内存中的输入和输出数据中。接下来的三个参数指定要处理的网格部分的尺寸,在本例中为 12 个切片。请注意,我们每侧需要有四个切片。第 37 行对阶段 1 中的右边界点执行相同的操作。请注意,这些内核将在Stream0内启动并按顺序执行。
Line 36 calls a function that will perform four computation steps to generate the four slices of the left boundary points in stage 1. We assume that there is a kernel that performs one computation step on a region of grip points. The launch_kernel() function takes several parameters. The first parameter is a pointer to the output data area for the kernel. The second parameter is a pointer to the input data area. In both cases, we add the left_stage1_offset to the input and output data in the device memory. The next three parameters specify the dimensions of the portion of the grid to be processed, which is 12 slices in this case. Note that we need to have four slices on each side. Line 37 does the same for the right boundary points in stage 1. Note that these kernels will be launched within stream0 and will be executed sequentially.
第 38 行启动一个内核来生成阶段 2 中的dimx*dimy*(dimz-8)内部点。请注意,这还需要每侧有四个输入边界值切片,因此输入切片的总数为dimx*dimy*dimz。内核在stream1中启动,并将与第36行和第37行启动的内核并行执行。
Line 38 launches a kernel to generate the dimx∗dimy∗(dimz-8) internal points in stage 2. Note that this also requires four slices of input boundary values on each side so the total number of input slices is dimx∗dimy∗dimz. The kernel is launched in stream1 and will be executed in parallel with those launched by lines 36 and 37.
图 19.16显示了计算过程代码的第 4 部分。第 39 行将左边界点的四个切片复制到主机内存,准备与左邻居进程进行数据交换。第 40 行将右边界点的四个切片复制到主机内存,准备与右邻居进程进行数据交换。两者都是stream0中的异步副本,并且会等待stream0中的两个内核完成后才复制数据。第40行是同步,强制进程等待stream0中的所有操作完成才能继续。这可确保在进程继续进行数据交换之前左边界点和右边界点位于主机内存中。
Figure 19.16 shows part 4 of the compute process code. Line 39 copies the four slices of left boundary points to the host memory in preparation for data exchange with the left neighbor process. Line 40 copies the four slices of the right boundary points to the host memory in preparation for data exchange with the right neighbor process. Both are asynchronous copies in stream0 and will wait for the two kernels in stream0 to complete before they copy data. Line 40 is a synchronization that forces the process to wait for all operations in stream0 to complete before it can continue. This makes sure that the left and right boundary points are in the host memory before the process proceeds with data exchange.
图 19.16计算过程代码(第 4 部分)。
Figure 19.16 Compute process code (part 4).
在数据交换阶段,我们将让所有 MPI 进程将其边界点发送到其左邻居。也就是说,所有进程都会有其正确的邻居向它们发送数据。因此,拥有一个 MPI 函数将数据发送到目的地并从源接收数据是很方便的。这减少了 MPI 函数调用的数量。图19.17中的MPI_Sendrecv()函数就是这样一个函数。它本质上是MPI_Send()和MPI_Recv()的组合,因此我们不会进一步详细说明参数的含义。
During the data exchange phase, we will have all MPI processes send their boundary points to their left neighbors. That is, all processes will have their right neighbors sending data to them. It is, therefore, convenient to have an MPI function that sends data to a destination and receives data from a source. This reduces the number of MPI function calls. The MPI_Sendrecv() function in Figure 19.17 is such a function. It is essentially a combination of MPI_Send() and MPI_Recv(), so we will not further elaborate on the meaning of the parameters.
图 19.17 MPI_Sendrecv()函数的语法。
Figure 19.17 Syntax for the MPI_Sendrecv() function.
图 19.18显示了计算过程代码的第 5 部分。第 42 行将左边界点的四个切片发送到左邻居,并从右邻居接收右晕点的四个切片。第 43 行将四片右边界点发送到右邻居,并从左邻居接收四片左晕点。对于进程 0,其left_neighbor在第 27 行中已设置为MPI_PROC_NULL,因此 MPI 运行时不会发送进程 0 的第 42 行中的消息或接收第 43 行中的消息。同样,MPI 运行时也不会接收第 42 行中的消息或为进程np-2发出第 43 行中的消息。因此,第 27 行和第 28 行中的条件赋值消除了第 42 行和第 43 行中特殊if-the-else语句的需要。
Figure 19.18 shows part 5 of the compute process code. Line 42 sends four slices of left boundary points to the left neighbor and receives four slices of right halo points from the right neighbors. Line 43 sends four slices of right boundary points to the right neighbor and receives four slices of left halo points from the left neighbor. In the case of process 0, its left_neighbor has been set to MPI_PROC_NULL in line 27, so the MPI runtime will not send out the message in line 42 or receive the message in line 43 for process 0. Likewise, the MPI runtime will not receive the message in line 42 or send out the message in line 43 for process np-2. Therefore, the conditional assignments in lines 27 and 28 eliminate the need for special if-the-else statements in lines 42 and 43.
图 19.18计算过程代码(第 5 部分)。
Figure 19.18 Compute process code (part 5).
发送和接收 MPI 消息后,第 44 行和第 45 行将新接收到的光环点传输到设备内存的d_output缓冲区。这些副本在stream0中完成,因此它们将与第38行启动的内核并行执行。
After the MPI messages have been sent and received, lines 44 and 45 transfer the newly received halo points to the d_output buffer of the device memory. These copies are done in stream0 so they will execute in parallel with the kernel launched in line 38.
第 46 行是所有设备活动的同步操作。此调用强制进程等待所有设备活动(包括内核和数据副本)完成。当cudaDeviceSynchronize()函数返回时,当前计算步骤中的所有d_output数据都已就位:来自左邻居进程的左晕数据、来自第 36 行启动的内核的边界数据、来自第 38 行启动的内核的内部数据,右来自第 37 行启动的内核的边界数据,以及来自右邻居的右晕数据。
Line 46 is a synchronize operation for all device activities. This call forces the process to wait for all device activities, including kernels and data copies to complete. When the cudaDeviceSynchronize() function returns, all d_output data from the current computation step are in place: left halo data from the left neighbor process, boundary data from the kernel launched in line 36, internal data from the kernel launched in line 38, right boundary data from the kernel launched in line 37, and right halo data from the right neighbor.
第 47 行和第 48 行交换了d_input和d_output指针。这会将当前计算步骤的d_ouput数据的输出更改为下一个计算步骤的d_input数据。然后,执行将进入第 35 行循环的下一次迭代,进入下一个计算步骤。这将继续下去,直到所有计算进程完成参数nreps指定的计算次数。
Lines 47 and 48 swap the d_input and d_output pointers. This changes the output of the d_ouput data of the current computation step into the d_input data of the next computation step. The execution then proceeds to the next computation step by going to the next iteration of the loop of line 35. This will continue until all compute processes complete the number of computations specified by the parameter nreps.
图 19.19显示了计算过程代码的第 6 部分,即最后部分。第 46 行是屏障同步,强制所有进程互相等待完成其计算步骤。第 50-52 行将d_output与d_input交换。这是因为第 47 行和第 48 行将d_output与d_input交换,为下一个计算步骤做准备。然而,这对于最后的计算步骤是不必要的。因此,我们使用第 50-52 行来撤消交换。第 53 行将最终输出复制到主机内存。第 54 行将输出发送到数据服务器。第 55 行等待所有进程完成。第 56-59 行在返回主程序之前释放所有资源。
Figure 19.19 shows part 6, the final part, of the compute process code. Line 46 is a barrier synchronization that forces all processes to wait for each other to finish their computation steps. Lines 50-52 swap d_output with d_input. This is because lines 47 and 48 swapped d_output with d_input in preparation for the next computation step. However, this is unnecessary for the last computation step. So, we use lines 50-52 to undo the swap. Line 53 copies the final output to the host memory. Line 54 sends the output to the data server. Line 55 waits for all processes to complete. Lines 56-59 free all the resources before returning to the main program.
图 19.19计算过程代码(第 6 部分)。
Figure 19.19 Compute process code (part 6).
图 19.20显示了数据服务器进程代码的第三部分,即最后一部分,它是图 19.10的延续。第 20 行等待所有计算节点完成其计算步骤并发送其输出。该屏障对应于计算过程第 55 行的屏障。第 22 行接收来自所有计算进程的输出数据。第 23 行将输出存储到外部存储器中。第 24 行和第 25 行在返回主程序之前释放资源。
Figure 19.20 shows part 3, the final part, of the data server process code, which continues from Figure 19.10. Line 20 waits for all compute nodes to complete their computation steps and send their outputs. This barrier corresponds to the barrier at line 55 of the compute process. Line 22 receives the output data from all the compute processes. Line 23 stores the output into an external storage. Lines 24 and 25 free resources before returning to the main program.
图 19.20数据服务器进程代码(第 3 部分)。
Figure 19.20 Data server process code (part 3).
MPI 通信的第二种类型是集体通信,它涉及一组 MPI 进程。我们已经看到了第二种类型的 MPI 通信 API 的示例:MPI_Barrier。其他常用的群体集体通信类型有广播、缩减、聚集和分散。
The second type of MPI communication is collective communication, which involves a group of MPI processes. We have seen an example of the second type of MPI communication API: MPI_Barrier. The other commonly used group collective communication types are broadcast, reduction, gather, and scatter.
屏障同步MPI_Barrier()也许是最常用的集体通信函数。正如我们在模板示例中所看到的,屏障用于确保所有 MPI 进程在开始相互交互之前都已准备好。我们不会详细说明其他类型的 MPI 集体通信函数,但鼓励读者阅读这些函数的详细信息。一般来说,集体通信功能由 MPI 运行时开发人员和系统供应商进行了高度优化。使用它们通常会带来更好的性能以及可读性和生产力。
Barrier synchronization MPI_Barrier() is perhaps the most commonly used collective communication function. As we have seen in the stencil example, barriers are used to ensure that all MPI processes are ready before they begin to interact with each other. We will not elaborate on the other types of MPI collective communication functions but encourage readers to read up on the details of these functions. In general, collective communication functions are highly optimized by the MPI runtime developers and system vendors. Using them usually leads to better performance as well as readability and productivity.
我们在本章中介绍了 CUDA/MPI 联合编程的基本模式。 MPI 应用程序中的所有进程都运行相同的程序。但是,每个进程可以遵循不同的控制流和函数调用路径来专门化其角色,就像数据服务器和计算服务器的情况一样本章示例中的流程。我们还提出了计算进程交换数据的常见模式。我们提出了使用 CUDA 流和异步数据传输来实现计算和通信的重叠。我们想指出,虽然 MPI 是一个非常不同的编程系统,但我们在本章中介绍的所有主要 MPI 概念(SPMD、MPI 等级和障碍)在 CUDA 编程模型中都有对应的概念。这证实了我们的信念:通过很好地教授一种模型的并行编程,我们的学生可以轻松快速地掌握其他编程模型。我们希望鼓励读者在本章的基础上学习更高级的 MPI 功能和其他重要模式。
We covered basic patterns of joint CUDA/MPI programming in this chapter. All processes in an MPI application run the same program. However, each process can follow different control flow and function call paths to specialize their roles, as is the case of the data server and the compute processes in our example in this chapter. We also presented a common pattern where compute processes exchange data. We presented the use of CUDA streams and asynchronous data transfers to enable the overlap of computation and communication. We would like to point out that while MPI is a very different programming system, all major MPI concepts that we covered in this chapter—SPMD, MPI ranks, and barriers—have counterparts in the CUDA programming model. This confirms our belief that by teaching parallel programming with one model well, our students can quickly pick up other programming models easily. We would like to encourage readers to build on the foundation from this chapter and study more advanced MPI features and other important patterns.
19.1。 对于向量加法,如果每个向量中有 100,000 个元素,并且我们使用三个计算进程,那么我们将向最后一个计算进程发送多少个元素?
19.1. For vector addition, if there are 100,000 elements in each vector and we are using three compute processes, how many elements are we sending to the last compute process?
19.2. 如果 MPI 调用MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD)导致数据传输 40,000 字节,则发送的每个数据元素的大小是多少?
19.2. If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of 40,000 bytes, what is the size of each data element being sent?
19.3. Which of the following statements is true?
a. MPI_Send() is blocking by default.
b. MPI_Recv() is blocking by default.
c. MPI messages must be at least 128 bytes.
d. MPI processes can access the same variable through shared memory.
19.4。 使用附录 A中的代码库以及第3、4、5和6章中的示例来开发 OpenCL 版本的矩阵-矩阵乘法应用程序。
19.4. Use the code base in Appendix A and examples in Chapters 3, 4, 5, and 6 to develop an OpenCL version of the matrix–matrix multiplication application.
1. Gropp W、Lusk E、Skjellum A。使用 MPI:使用消息传递接口进行可移植并行编程第二版。马萨诸塞州剑桥:麻省理工学院出版社,科学与工程计算系列; 1999 年。
1. Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message Passing Interface 2nd ed. Cambridge, MA: MIT Press, Scientific and Engineering Computation Series; 1999.
20.1 背景
20.1 Background
20.2 动态并行性概述
20.2 Dynamic Parallelism Overview
20.3 重要细节
20.3 Important Details
20.4 内存可见性
20.4 Memory Visibility
20.5 一个简单的例子
20.5 A Simple Example
20.6 运行时限制
20.6 Runtime Limitations
20.7 一个更复杂的例子
20.7 A More Complex Example
20.8 概括
20.8 Summary
CUDA 动态并行性是 CUDA 编程模型的扩展,使 CUDA 内核能够通过启动新内核来创建新的线程网格。动态并行性是在 Kepler 架构中引入的,首先出现在 GK110 芯片中。在以前的 CUDA 系统中,内核只能从主机代码启动。涉及递归、不规则循环结构、时空变化或其他不适合平面、单级并行性的结构的算法需要通过多个内核启动来实现,这增加了主机的负担和主机设备通信量。动态并行支持允许算法动态发现新工作来准备和启动内核,而不会给主机带来负担。本章介绍了支持动态并行性的 CUDA 架构的扩展功能,包括利用这些功能所需的对 CUDA 编程模型的修改和添加,以及利用这种附加功能的指南和最佳实践。
CUDA dynamic parallelism is an extension to the CUDA programming model enabling a CUDA kernel to create new thread grids by launching new kernels. Dynamic parallelism is introduced with the Kepler architecture, first appearing in the GK110 chip. In previous CUDA systems, kernels can only be launched from the host code. Algorithms that involved recursion, irregular loop structures, time-space variation, or other constructs that do not fit a flat, single level of parallelism needed to be implemented with multiple kernel launches, which increases burden on the host and amount of host-device communication. The dynamic parallelism support allows algorithms that dynamically discover new work to prepare and launch kernels without burdening the host. This chapter describes the extended capabilities of the CUDA architecture that enables dynamic parallelism, including the modifications and additions to the CUDA programming model necessary to take advantage of these, as well as guidelines and best practices for exploiting this added capacity.
许多现实世界的应用程序采用动态改变所执行工作量的算法。例如,图 20.1显示了一个湍流模拟示例,其中所需的建模细节水平随空间和时间的不同而变化。随着燃烧流从左向右移动,活动水平和强度增加。模型右侧建模所需的细节级别远高于模型左侧。一方面,使用固定的精细网格会产生过多的工作,而模型的左侧却没有任何增益。另一方面,使用固定的粗网格会牺牲模型右侧太多的精度。理想情况下,应该对模型中需要更多细节的部分使用精细网格,对不需要那么多细节的部分使用粗网格。
Many real-world applications employ algorithms that dynamically vary the amount of work performed. For example, Figure 20.1 shows a turbulence simulation example where the level of required modeling details varies across space and time. As the combustion flow moves from left to right, the level of activities and intensity increases. The level of details required to model the right side of the model is much higher than that for the left side of the model. On one hand, using a fixed fine grid would incur too much work for no gain for the left side of the model. On the other hand, using a fixed coarse grid would sacrifice too much accuracy for the right side of the model. Ideally, one should use fine grids for the parts of the model that require more details and coarse grids for those that do not require as many details.
图 20.1湍流模拟模型的固定网格与动态网格。
Figure 20.1 Fixed versus dynamic grids for a turbulence simulation model.
以前的 CUDA 系统要求所有内核都从主机代码启动。线程网格完成的工作量是在内核启动期间预先确定的。对于内核代码的 SPMD 编程风格,让线程块使用不同的网格间距即使不是极其困难,也是乏味的。这种限制有利于使用固定网格系统正如我们在第 12 章中讨论的那样。为了达到所需的精度,如图20.1所示,这种固定网格方法通常需要适应模型中要求最高的部分,并在不需要太多细节的部分中执行不必要的额外工作。
Previous CUDA systems require all kernels to be launched from the host code. The amount of work done by a thread grid is predetermined during kernel launch. With the SPMD programming style for the kernel code, it is tedious if not extremely difficult to have thread blocks to use different grid spacing. This limitation favors the use of fixed-grid systems as we discussed in Chapter 12. To achieve the desired accuracy, such a fixed-grid approach, as illustrated in Figure 20.1, typically needs to accommodate the most demanding parts of the model and perform unnecessary extra work in parts that do not require as much detail.
更理想的方法如图20.1右下部分的动态网格所示。当仿真算法检测到模型某些区域中快速变化的仿真量时,它会细化这些区域中的网格以达到所需的精度水平。对于没有表现出如此密集活动的区域,不需要进行这种细化。这样,算法可以动态地将更多计算工作引导到模型中受益于额外工作的区域。
A more desirable approach is shown as the dynamic grid in the lower right portion of Figure 20.1. As the simulation algorithm detects fast-changing simulation quantities in some areas of the model, it refines the grid in those areas to achieve the desired level of accuracy. Such refinement does not need to be done for the areas that do not exhibit such intensive activity. This way, the algorithm can dynamically direct more computation work to the areas of the model that benefit from the additional work.
图 20.2显示了原始 CUDA 和动态并行版本相对于图 20.1中的仿真模型的概念比较。如果没有动态并行性,主机代码必须启动所有内核。如果发现新的工作,例如在内核执行期间细化模型某个区域的网格,则需要向主机代码报告并让主机代码启动新内核。如图 20.2(a)所示,其中主机启动一波内核,从这些内核接收信息,并为已完成的内核发现的任何新工作启动下一级内核。
Figure 20.2 shows a conceptual comparison between the original CUDA and the dynamic parallelism version with respect to the simulation model in Figure 20.1. Without dynamic parallelism, the host code must launch all kernels. If new work is discovered, such as refining the grid of an area of the model during the execution of a kernel, it needs to report back to the host code and have the host code to launch a new kernel. This is illustrated in Figure 20.2(a), where the host launches a wave of kernels, receives information from these kernels, and launches the next level of kernels for any new work discovered by the completed kernels.
图 20.2具有动态工作变化的算法的内核启动模式:(a) 无动态并行性和 (b) 有动态并行性。
Figure 20.2 Kernel launch patterns for algorithms with dynamic work variation: (a) without dynamic parallelism and (b) with dynamic parallelism.
图 20.2(b)显示,通过动态并行性,发现新工作的线程可以继续启动内核来完成工作。在我们的示例中,当线程发现模型的某个区域需要为了进行细化,它可以启动内核以在细化的网格区域上执行计算步骤,而无需终止内核、向主机报告并让主机启动新内核的开销。
Figure 20.2(b) shows that with dynamic parallelism, the threads that discover new work can just go ahead and launch kernels to do the work. In our example, when a thread discovers that an area of the model needs to be refined, it can launch a kernel to perform the computation step on the refined grid area without the overhead of terminating the kernel, reporting back to the host, and having the host to launch new kernels.
从程序员的角度来看,动态并行意味着他或她可以在内核中编写内核启动语句。在图 20.3中,主函数(主机代码)启动三个内核:A、B和C。这些是原始 CUDA 模型中的内核启动。不同之处在于,内核之一B启动了三个内核X、Y和Z。这在以前的 CUDA 系统中是非法的。
From the programmer’s perspective dynamic parallelism means that he or she can write a kernel launch statement in a kernel. In Figure 20.3, the main function (host code) launches three kernels, A, B, and C. These are kernel launches in the original CUDA model. What is different is that one of the kernels, B, launches three kernels X, Y, and Z. This would have been illegal in previous CUDA systems.
图 20.3一个内核 ( B ) 启动三个内核(X、Y和Z )的简单示例。
Figure 20.3 A simple example of a kernel (B) launching three kernels (X, Y, and Z).
从内核启动内核的语法与从主机代码启动内核的语法相同:
The syntax for launching a kernel from a kernel is the same as that for launching a kernel from host code:
kernel_name<<< Dg, Db, Ns, S >>>([内核参数])
kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments])
• Dg is of type dim3 and specifies the dimensions and size of the grid.
• Db is of type dim3 and specifies the dimensions and size of each thread block.
• Ns的类型为size_t,指定为此调用每个线程块动态分配的共享内存的字节数,这是除了静态分配的共享内存之外的。Ns是一个可选参数,默认为 0。
• Ns is of type size_t and specifies the number of bytes of shared memory that are dynamically allocated per thread block for this call, which is in addition to the statically allocated shared memory. Ns is an optional argument that defaults to 0.
• S的类型为cudaStream_t,指定与此调用关联的流。该流必须已分配在进行调用的同一线程块中。S是一个可选参数,默认为 0。
• S is of type cudaStream_t and specifies the stream associated with this call. The stream must have been allocated in the same thread block where the call is being made. S is an optional argument that defaults to 0.
尽管从内核启动内核的语法与从主机代码启动内核的语法相似,但程序员必须清楚地理解几个重要的区别。
Although the syntax for launching a kernel from a kernel is similar to that for launching a kernel from the host code, there are several important differences that must be clearly understood by programmers.
所有设备配置设置(例如,从cudaDeviceGetCacheConfig()返回的共享内存和 L1 缓存大小,以及从cudaDeviceGetLimit()返回的设备限制)都将从父级继承。也就是说,如果父级配置为 16 K 字节的共享内存和 48 K 字节的 L1 缓存,则子级的执行设置将配置为相同。同样,父级的设备限制(例如堆栈大小)将按原样传递给其子级。
All device configuration settings (e.g., shared memory and L1 cache size as returned from cudaDeviceGetCacheConfig(), and device limits as returned from cudaDeviceGetLimit()) will be inherited from the parent. That is, if the parent is configured for 16 K bytes of shared memory and 48 K bytes of L1 cache, then the child’s execution settings will be configured identically. Likewise, a parent’s device limits such as stack size will be passed as-is to its children.
与主机代码中的 CUDA API 函数调用类似,内核中调用的任何 CUDA API 函数都可能返回错误代码。返回的最后一个错误代码会被记录下来,并且可以通过cudaGetLastError()调用进行检索。错误是按线程记录的,以便每个线程都可以识别它生成的最新错误。错误代码的类型为cudaError_t,它是一个 32 位整数值。
Like CUDA API function calls in host code, any CUDA API function called within a kernel may return an error code. The last error code returned is recorded and may be retrieved via the cudaGetLastError() call. Errors are recorded on a per-thread basis, so that each thread can identify the most recent error that it has generated. The error code is of type cudaError_t, which is a 32-bit integer value.
内核函数仅支持 CUDA 事件的流间同步功能。内核函数当前不支持单个流中的事件。这意味着支持cudaStreamWaitEvent() ,但不支持cudaEventSynchronize() 、 cudaEventElapsedTime()计时以及通过cudaEventQuery()进行事件查询。未来版本可能会支持这些。
Only the interstream synchronization capabilities of CUDA events are supported in kernel functions. Events within individual streams are currently not supported in kernel functions. This means cudaStreamWaitEvent() is supported, but cudaEventSynchronize(), timing with cudaEventElapsedTime(), and event query via cudaEventQuery() are not. These may be supported in a future version.
为了确保用户清楚地看到此限制,必须通过cudaEventCreateWithFlags()创建动态并行性cudaEvents,当前仅在从内核调用时接受cudaEventDisableTiming标志值。
To ensure that this restriction is clearly seen by the user, dynamic parallelism cudaEvents must be created via cudaEventCreateWithFlags(), which currently only accepts the cudaEventDisableTiming flag value when called from a kernel.
事件对象可以在创建它们的 CUDA 线程块内的线程之间共享,但对于该块来说是本地的,并且不应传递给子/父内核。不保证块之间的事件句柄是唯一的,因此在未分配事件句柄的块中使用事件句柄将导致未定义的行为。
Event objects may be shared between the threads within the CUDA thread-block that created them, but are local to that block and should not be passed to child/parent kernels. Event handles are not guaranteed unique between blocks, so using an event handle within a block that did not allocate it will result in undefined behavior.
命名和未命名(NULL)流在动态并行下都可用。命名流可由线程块内的任何线程使用,但流句柄不应传递给其他块或子/父内核。换句话说,流应该被视为创建它的块的私有流。不保证块之间的流句柄是唯一的,因此在未分配流句柄的块中使用流句柄将导致未定义的行为。
Both named and unnamed (NULL) streams are available under dynamic parallelism. Named streams may be used by any thread within a thread block, but stream handles should not be passed to other blocks or child/parent kernels. In other words, a stream should be treated as private to the block in which it is created. Stream handles are not guaranteed to be unique between blocks, so using a stream handle within a block that did not allocate it will result in undefined behavior.
与主机端启动类似,启动到单独流中的工作可能会同时运行,但不能保证实际的并发性。需要子内核之间并发的程序格式不正确,并且会出现未定义的行为。
Similar to host-side launch, work launched into separate streams may run concurrently, but actual concurrency is not guaranteed. Programs that require concurrency between child kernels are ill-formed and will have undefined behavior.
动态并行下不支持主机端 NULL 流的全局同步语义。要显式指示此行为更改,必须使用cudaStreamCreateWithFlags() API 以及内核中的cudaStreamNonBlocking标志来创建所有流。调用cudaStreamCreate()将会失败,并出现编译器“无法识别的函数调用”错误,以便明确动态并行下不同的流语义。
The host-side NULL stream’s global synchronization semantic is not supported under dynamic parallelism. To explicitly indicate this behavior change all streams must be created using the cudaStreamCreateWithFlags() API with the cudaStreamNonBlocking flag in a kernel. Calls to cudaStreamCreate() will fail with a compiler “unrecognized function call” error, so as to make clear the different stream semantic under dynamic parallelism.
cudaStreamSynchronize () API 在内核中不可用;仅cudaDeviceSynchronize()可用于显式等待启动的工作完成。这是因为底层系统软件仅实现块范围的同步调用,并且不希望提供具有不完整语义的API(即,同步保证一个流同步,但同时提供了一个完整的屏障作为副作用)。
The cudaStreamSynchronize() API is not available within a kernel; only cudaDeviceSynchronize() can be used to wait explicitly for launched work to complete. This is because the underlying system software implements only a block-wide synchronization call, and it is undesirable to offer an API with incomplete semantics (i.e., the synchronize guarantees one stream synchronizes, but coincidentally provides a full barrier as a side effect).
作为执行网格的一部分并配置和启动新网格的线程属于父网格,而启动创建的网格是子网格。如图20.4所示,子网格的创建和完成是正确嵌套的,这意味着父网格在其线程创建的所有子网格都完成之前才被认为是完成的。即使父线程没有在启动的子网格上显式同步,运行时也会通过强制父线程等待其所有子线程退出执行才能退出执行,从而保证父线程和子线程之间的隐式同步。
A thread that is part of an executing grid and configures and launches a new grid belongs to the parent grid, and the grid created by the launch is the child grid. As shown in Figure 20.4, the creation and completion of child grids is properly nested, meaning that the parent grid is not considered complete until all child grids created by its threads have completed. Even if the parent threads do not explicitly synchronize on the child grids launched, the runtime guarantees an implicit synchronization between the parent and child by forcing the parent to wait for all its children to exit execution before it can exit execution.
图 20.4父网格和子网格的完成顺序。
Figure 20.4 Completion sequence for parent and child grids.
父网格中的线程只能在该线程启动的网格(例如,使用cudaDeviceSynchronize())、线程块中的其他线程(例如,使用__synchthreads())或在同一线程块内创建的流上执行同步(例如,使用cudaStreamWaitEvent())。由网格内的线程创建的流仅存在于该线程的线程块范围内,并且在创建它们的线程块之外使用时具有未定义的行为。当线程块中的所有线程退出执行时,在线程块内创建的流将隐式同步。在线程块范围之外修改的流上的操作的行为是未定义的。在主机上创建的流在任何内核中使用时都具有未定义的行为,就像父网格创建的流在子网格中使用时具有未定义的行为一样。
A thread in the parent grid may only perform synchronization on the grids launched by that thread (e.g., using cudaDeviceSynchronize()), other threads in the thread block (e.g., using __synchthreads()), or on streams created within the same thread block (e.g., using cudaStreamWaitEvent()). Streams created by a thread within a grid exist only within the thread’s thread block scope and have undefined behavior when used outside of the thread block where they were created. Streams created within a thread block are implicitly synchronized when all threads in the thread block exit execution. The behavior of operations on a stream that has been modified outside of the thread block scope is undefined. Streams created on the host have undefined behavior when used within any kernel, just as streams created by a parent grid have undefined behavior if used within a child grid.
父网格和子网格可以一致地访问全局内存,而子网格和父网格之间的一致性保证很弱。当子网格的内存视图与父线程完全一致时,子网格的执行过程中有两点:(1) 当子网格由父线程创建时,以及 (2) 当子网格按照父线程中的同步 API 调用。
Parent and child grids have coherent access to global memory, with weak consistency guarantees between child and parent. There are two points in the execution of a child grid when its view of memory is fully consistent with the parent thread: (1) when the child grid is created by the parent, and (2) when the child grid completes as signaled by a synchronization API call in the parent thread.
在子网格调用之前父线程中的所有全局内存操作对于子网格都是可见的。当父级同步子网格的完成后,子网格的所有内存操作对父级都是可见的。
All global memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid. All memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid’s completion.
零拷贝系统内存具有与全局内存相同的连贯性和一致性保证,并遵循刚才详述的语义。然而,内核可能不会分配或释放零拷贝内存,但可以使用从主机代码传入的指针。
Zero-copy system memory has identical coherence and consistency guarantees as global memory, and follows the semantics just detailed. A kernel may not allocate or free zero-copy memory, however, but may use pointers passed in from the host code.
常量是不可变的,即使在动态并行内核启动之间,也不能由内核写入。也就是说,所有__constant__变量的值必须在启动第一个内核之前从主机设置。常量内存变量对所有内核全局可见,因此必须在主机代码调用的动态并行启动树的生命周期内保持常量。
Constants are immutable and may not be written to by a kernel, even between dynamic parallelism kernel launches. That is, the value of all __constant__ variables must be set from the host prior to launch of the first kernel. Constant memory variables are globally visible to all kernels, and so must remain constant for the lifetime of the dynamic parallelism launch tree invoked by the host code.
从线程内获取常量内存对象的地址与非动态并行程序具有相同的语义,并且完全支持将该指针从父级传递到子级或从子级传递到父级。
Taking the address of a constant memory object from within a thread has the same semantics as for non-dynamic parallelism programs, and passing that pointer from parent to child or from a child to parent is fully supported.
本地内存是线程的私有存储,在该线程外部不可见。启动子内核时,将指向本地内存的指针作为启动参数传递是非法的。从子级取消引用此类本地内存指针的结果将是未定义的。例如,以下是非法的,如果x_array由child_launch访问,则行为未定义:
Local memory is private storage for a thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined. For example, the following is illegal, with undefined behavior if x_array is accessed by child_launch:
int x_array[10]; // 在父级的本地内存中创建 x_array
int x_array[10]; // Creates x_array in parent’s local memory
child_launch<<< 1, 1 >>>(x_array);
child_launch<<< 1, 1 >>>(x_array);
有时,程序员很难意识到编译器何时将变量放入本地内存。作为一般规则,传递给子内核的所有存储都应该从全局内存堆中显式分配,可以使用malloc()或new()或通过在全局范围声明__device__存储。例如,图 20.5(a)显示了一个有效的内核启动,其中指向全局内存变量的指针作为参数传递到子内核中。图 20.5(b)显示了一个无效代码,其中指向本地内存(寄存器)变量的指针被传递到子内核中。
It is sometimes difficult for a programmer to be aware of when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with malloc() or new() or by declaring __device__ storage at the global scope. For example, Figure 20.5(a) shows a valid kernel launch where a pointer to a global memory variable is passed as an argument into the child kernel. Figure 20.5(b) shows an invalid code where a pointer to a local memory (register) variable is passed into the child kernel.
图 20.5将指针作为参数传递给子内核:(a) 有效(值是全局存储)和 (b) 无效(值是本地存储)。
Figure 20.5 Passing a pointer as an argument to a child kernel: (a) valid (value is global storage) and (b) invalid (value is local storage).
如果 NVIDIA 编译器检测到指向本地内存的指针作为参数传递给内核启动,则会发出警告。然而,此类检测并不能得到保证。
The NVIDIA compiler will issue a warning if it detects that a pointer to local memory is being passed as an argument to a kernel launch. However, such detections are not guaranteed.
共享内存是正在执行的线程块的私有存储,数据在该线程块之外不可见。通过内存或作为参数将共享内存的指针传递给子内核将导致未定义的行为。
Shared memory is private storage for an executing thread block, and data is not visible outside of that thread block. Passing a pointer to shared memory to a child kernel either through memory or as an argument will result in undefined behavior.
纹理存储器访问(只读)在可能与可写的全局存储器区域别名的存储器区域上执行。纹理内存的一致性是在调用子网格以及子网格时强制执行的。网格完成。这意味着在子内核启动之前对内存的写入会反映在子内核的纹理内存访问中。此外,在父级同步子级的完成后,子级对内存的写入将反映在父级的纹理内存访问中。
Texture memory accesses (read only) are performed on a memory region that may be aliased to the global memory region that is writable. Coherence for texture memory is enforced at the invocation of a child grid and when a child grid completes. This means that writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Also, writes to memory by a child will be reflected in the texture memory accesses by a parent, after the parent synchronizes on the child’s completion.
并发纹理内存访问和写入全局内存对象(在父级与其子级之间或多个子级之间别名纹理内存对象)将导致未定义的行为。
Concurrent texture memory access and writes to global memory objects that alias the texture memory objects between a parent and its children or between multiple children will result in undefined behavior.
在本节中,我们提供了两种风格的简单编码示例 - 第一种是原始 CUDA 风格,第二种是动态并行风格。该示例问题是从假设的并行算法的发散阶段中提取的。它不计算有用的结果,但提供概念上简单且易于验证的计算。它用于说明两种风格之间的差异,以及当算法中每个线程完成的工作量可以动态变化时,如何使用动态并行风格来减少控制流发散。
In this section, we provide a simple example of coding in each of two styles—first in the original CUDA style, and second in the dynamic parallelism style. The example problem is extracted from the divergent phase of a hypothetical parallel algorithm. It does not compute useful results but provides a conceptually simple calculation that can be easily verified. It serves to illustrate the difference between the two styles and how one can use the dynamic parallelism style to reduce control flow divergence when the amount of work done by each thread in an algorithm can vary dynamically.
图 20.6的第 22 行显示了没有动态并行编码的示例的主机代码主函数。它在设备上分配foo变量(第 25 行)并将其初始化为 0(第 26 行)。然后它启动diverge_cta()内核来对foo执行计算(第 27 行)。内核以K(第 5 行中设置为 2)个块的网格启动,每个块有 32× M(M在第 4 行中设置为 32)个线程。因此,在此示例中,我们将启动两个块,每个块有 1,024 个线程。
Line 22 of Figure 20.6 shows the host code main function for the example coded without dynamic parallelism. It allocates the foo variable on the device (line 25) and initializes it to 0 (line 26). It then launches the diverge_cta() kernel to perform a calculation on foo (line 27). The kernel is launched with a grid of K (set to 2 in line 5) blocks of 32×M (M set to 32 in line 4) threads each. Therefore, in this example, we are launching two blocks of 1,024 threads each.
图 20.6一个用 CUDA 编码的假设并行算法的发散阶段的简单示例,没有动态并行性。
Figure 20.6 A simple example of the divergent phase of a hypothetical parallel algorithm coded in CUDA without dynamic parallelism.
在diverge_cta()内核中, threadIdx.x值不是32的倍数的线程将立即返回。在我们的示例中,只有threadIdx.x值为 0、32、64、…、960、992 的线程才会继续执行。在第 16 行中,每个块的所有剩余M线程将调用entry()函数,该函数会将foo变量递增N(在第 3 行中设置为 128)次。这是通过第 8 行中的for循环完成的。第 9 行中的原子操作是必要的,因为有多个块同时调用entry()函数。原子操作确保其中一个块的增量不会被其他块的增量所践踏。在我们的例子中,原子操作确保两个线程块的所有增量都正确反映在变量foo中。
In the diverge_cta() kernel, threads of which the threadIdx.x values are not a multiple of 32 will return immediately. In our example, only the threads with threadIdx.x values of 0, 32, 64, …, 960, 992 will continue to execute. In line 16, all remaining M threads of each block will call the entry() function, which will increment the foo variable N (set to 128 in line 3) times. This is done by the for loop in line 8. The atomic operation in line 9 is necessary because there are multiple blocks calling the entry() function at the same time. The atomic operation ensures that increments by one of the blocks are not trampled by those of other blocks. In our case, the atomic operation ensures that all increments by both thread blocks are properly reflected in the variable foo.
所有块完成增量后,foo的值应为K × M × N,因为有K 个块,每个块有M 个活动线程,每个线程将foo变量增量N次。在第 17 行,每个块的线程 0 将共享内存变量x(在第 13 行声明)初始化为值 5,这对同一块中的所有线程可见。然后线程 0 终止。在屏障同步(第 20 行)之后,每个块中所有剩余的M-1线程将对变量foo执行原子操作(第 21 行)。原子操作的增量为x (5)的值。由于只有M-1 个线程在执行(所有线程的threadIdx.x值都是 32 的倍数),因此块中的所有线程应共同将5*(M-1)添加到foo的值上。网格中共有K 个块,所有块中第 21 行的总贡献为K*(5*(M-1))。
After all blocks have completed their increments, the value of foo should be K×M×N, since there are K blocks and each block has M active threads each incrementing the foo variable N times. In line 17, thread 0 of each block initializes a shared memory variable x (declared in line 13) to value 5, which is visible to all threads in the same block. Thread 0 then terminates. After barrier synchronization (line 20), all remaining M-1 threads in each block will perform an atomic operation on variable foo (line 21). The increment amount of the atomic operation is the value of x (5). Since there are only M-1 threads executing (all of which the threadIdx.x values are multiples of 32), all threads in a block should jointly add 5∗(M-1) to the value of foo. With a total of K blocks in the grid, the total contribution due to line 21 among all blocks is K∗(5∗(M-1)).
内核终止后(第 29 行),主机将foo的值复制到其变量h_foo中(第 30 行)。然后主机执行测试并检查h_foo中的总值是否为K*N*M + K*(5*(M-1))的期望值,即K*(N*M+5*(M) -1))(第 31 行)。
After the kernel terminates (line 29), the host copies the value of foo into its variable h_foo (line 30). The host then performs a test and checks if the total value in h_foo is the expected value of K∗N∗M + K∗(5∗(M-1)), which is K∗(N∗M+5∗(M-1)) (line 31).
图 20.7显示了基于动态并行性的源代码版本。主要功能与图20.6相同,未示出。另外,我们只为与图 20.6不同的行分配行号。在此版本中,我们将让每个块的线程 0作为内核启动entry (),而不是让每个块的线程 0 调用设备函数entry() 。在第2行中,图20.6中的设备函数entry()现在被声明为内核函数。
Figure 20.7 shows a version of the source code based on dynamic parallelism. The main function is identical to that of Figure 20.6 and is not shown. Also, we only assign line numbers to the lines that are different from Figure 20.6. In this version, instead of having thread 0 of each block to call the device function entry(), we will have each of them to launch entry() as a kernel. In line 2, the device function entry() in Figure 20.6 is now declared as a kernel function.
图 20.7使用动态并行性修改的diverge_cta()内核。
Figure 20.7 The diverge_cta() kernel revised using dynamic parallelism.
在第 3 行中,diverge_cta()内核启动了只有一个块的entry()内核,其中包含M线程。K(设置为 2)次内核启动已完成。在我们的示例中,一个由块 0 的线程 0 启动,另一个由块 1 的线程 0 启动。我们没有让块中剩余的M个线程中的每一个线程调用entry()作为设备函数,而是使用每个线程的线程 0阻止将entry()作为具有M个线程的内核启动。
In line 3, the diverge_cta() kernel launches the entry() kernel with only one block, which contains the M thread. K (set to 2) kernel launches are done. In our example, one is launched by thread 0 of block 0 and one by thread 0 of block 1. Instead of having each of the remaining M threads of a block to call entry() as a device function, we use thread 0 of each block to launch entry() as a kernel with M threads.
请注意,对foo值的影响保持不变。 Entry ()内核被启动K次。对于每次启动,有M 个线程执行entry()内核,每个线程将foo值递增N。因此,所有线程引起的总变化为K*M*N。然而,分歧的大小会发生变化。原来的内核还是有分歧的。然而,增量现在由entry()内核完成,其中所有相邻线程都采用相同的控制流路径。代码花在控制发散执行上的时间减少了。
Note that the effect on the foo value remains the same. The entry() kernel is launched K times. For each launch, there are M threads executing the entry() kernel, and each thread increments the foo value by N. Therefore, the total changes due to all threads are K∗M∗N. However, amount of divergence changes. The original kernel still has divergence. However, the increments are now done by the entry() kernel where all neighboring threads are taking the same control flow path. The amount of time the code spends in control-divergent execution decreases.
内存被分配为父内核状态的后备存储,以便在子启动时进行同步。保守地说,这个记忆必须支持存储 GPU 上可能存在的最大数量的活动线程的状态。这反过来意味着,在当前一代设备中,每一级嵌套都需要约 150 MB 的设备内存,即使未全部消耗,也无法供程序使用。动态并行运行时系统会检测父级是否退出而不调用cudaDeviceSynchronize()。在这种情况下,运行时不会保存父级的状态,并且程序所需的内存占用将远小于保守的最大值。
Memory is allocated as the backing-store for the parent kernel state to be used when synchronizing on a child launch. Conservatively, this memory must support storing of state for the maximum number of live threads possible on the GPU. This in turn means that each level of nesting requires ~150 MB of device memory in a current generation device, which will be unavailable for program use even if it is not all consumed. The dynamic parallelism runtime system detects if the parent exits without calling cudaDeviceSynchronize(). In this case, the runtime does not save the parent’s state and the memory footprint required for the program will be much less than the conservative maximum.
除了线程后备存储之外,系统软件还使用更多内存,例如存储启动队列和事件。动态并行的总内存占用量很难准确指定,但可以在运行时查询。
In addition to the thread backing-store, more memory is used by the system software, for example, to store launch queues and events. The total memory footprint of dynamic parallelism is difficult to specify exactly, but may be queried at runtime.
在动态并行性下,一个内核可以启动另一个内核,并且该内核可以启动另一个内核,依此类推。每一次下级发射都被视为一个新的“嵌套层数”,总层数就是该程序的“嵌套深度”。
Under dynamic parallelism, one kernel may launch another kernel, and that kernel may launch another, and so on. Each subordinate launch is considered a new “nesting level,” and the total number of levels is the “nesting depth” of the program.
最大嵌套深度在硬件中限制为 64,但在软件中可能限制为 63 或更少。实际上,真正的限制是系统每个新级别所需的内存量(请参阅前面的“内存占用”部分)。必须在主机启动顶层内核之前配置支持的层数,以保证嵌套程序的成功执行。
The maximum nesting depth is limited in hardware to 64, but in software it may be limited to 63 or less. Practically speaking, the real limit will be the amount of memory required by the system for each new level (see the preceding “Memory Footprint” section). The number of levels to be supported must be configured before the top-level kernel is launched from the host, to guarantee successful execution of a nested program.
目前,cudaMalloc和cudaFree在主机和设备环境之间的语义略有修改(表 20.1)。在设备环境中,总的可分配内存仅限于设备malloc()堆大小,该大小可能小于可用的未使用设备内存。此外,从主机程序在设备上由cudaMalloc分配的指针上调用cudaFree或从设备程序在主机上由cudaMalloc分配的指针上调用cudaFree也是错误的。这些限制可能会在未来版本中被删除。
Currently, cudaMalloc and cudaFree have slightly modified semantics between the host and device environments (Table 20.1). Within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory. Also, it is an error to invoke cudaFree from the host program on a pointer that was allocated by cudaMalloc on the device, or to invoke cudaFree from the device program on a pointer that was allocated by cudaMalloc on the host. These limitations may be removed in a future version.
表 20.1主机和设备的内存分配和释放。
Table 20.1 Memory allocation and deallocation from host and device.
| 主机上的cudaMalloc() | 设备上的cudaMalloc() | |
| 主机上的cudaFree() | 支持的 | 不支持 |
| 设备上的cudaFree() | 不支持 | 支持的 |
| 分配限额 | 释放设备内存 | cudaLimitMallocHeapSize |
CUDA 内核中的代码无法获得 ECC 错误通知。 ECC 错误仅在主机端报告。动态并行内核执行期间出现的任何 ECC 错误都将生成异常或继续执行(取决于错误和配置)。
No notification of ECC errors is available to code within a CUDA kernel. ECC errors are only reported at the host side. Any ECC errors that arise during execution of a dynamic parallelism kernel will either generate an exception or continue execution (depending on error and configuration).
每个块支持无限的命名流,但平台支持的最大并发数是有限的。如果创建的流数量超出了支持并发执行的能力,则其中一些流可能会相互序列化或别名。除了块范围命名流之外,每个线程还有一个未命名(NULL)流,但命名流不会与其同步(事实上,所有命名流都必须使用明确阻止这种情况的标志来创建)。
Unlimited named streams are supported per block, but the maximum concurrency supported by the platform is limited. If more streams are created than can support concurrent execution, some of these may serialize or alias with each other. In addition to block-scope named streams, each thread has an unnamed (NULL) stream, but named streams will not synchronize against it (indeed, all named streams must be created with a flag explicitly preventing this).
每个块支持无限的事件,但这些会消耗设备内存。由于资源限制,如果创建的事件过多(具体数量取决于实现),则 GPU 启动的网格获得的并发性可能会低于预期。然而,正确的执行是有保证的。
Unlimited events are supported per block, but these consume device memory. Owing to resource limitations, if too many events are created (exact number is implementation-dependent), then GPU-launched grids may attain less concurrency than might be expected. Correct execution is guaranteed, however.
当内核启动时,所有关联的数据都会添加到启动池中的一个槽中,并对其进行跟踪,直到内核完成。启动池存储可以由系统在设备和主机内存之间进行虚拟化;然而,设备端启动池存储提高了性能。在从主机初始内核启动之前,为设备端启动池存储保留的设备内存量是可配置的。
When a kernel is launched, all associated data is added to a slot within the launch pool, which is tracked until the kernel completes. Launch pool storage may be virtualized by the system, between device and host memory; however, device-side launch pool storage has improved performance. The amount of device memory reserved for device-side launch pool storage is configurable prior to the initial kernel launch from the host.
我们现在展示一个示例,它是样条曲线的递归、自适应细分的更有趣和有用的案例。这说明了根据工作负载启动不同数量的子内核。该示例是计算贝塞尔曲线[Wiki_Bezier 2012],该曲线经常出现在计算机图形学中用于绘制由一组控制点定义的平滑、直观的曲线,这些控制点通常由用户定义。
We now show an example that is a more interesting and useful case of recursive, adaptive subdivision of spline curves. This illustrates a variable amount of child kernel launches, according to the workload. The example is to calculate Bezier curves [Wiki_Bezier 2012], which are frequently used in computer graphics to draw smooth, intuitive curves that are defined by a set of control points, which are typically defined by a user.
从数学上讲,贝塞尔曲线由一组控制点P 0到P n定义,其中n称为其阶(n =1 表示线性,2 表示二次,3 表示三次,等等)。第一个和最后一个控制点始终是曲线的终点;然而,中间控制点(如果有)通常不在曲线上。
Mathematically, a Bezier curve is defined by a set of control points P0 through Pn, where n is called its order (n=1 for linear, 2 for quadratic, 3 for cubic, etc.). The first and last control points are always the end points of the curve; however, the intermediate control points (if any) generally do not lie on the curve.
给定两个控制点P 0和P 1,线性贝塞尔曲线只是连接这两点之间的直线。曲线上各点的坐标由以下线性插值公式给出:
Given two control points P0 and P1, a linear Bezier curve is simply a straight line connecting between those two points. The coordinates of the points on the curve are given by the following linear interpolation formula:
二次贝塞尔曲线由三个控制点P 0、P 1和P 2定义。二次曲线上的点被定义为分别从P 0到P 1和从P 1到P 2的线性贝塞尔曲线上的对应点的线性插值。曲线上各点坐标的计算用以下公式表示:
A quadratic Bezier curve is defined by three control points P0, P1, and P2. The points on a quadratic curve are defined as a linear interpolation of corresponding points on the linear Bezier curves from P0 to P1 and from P1 to P2, respectively. The calculation of the coordinates of points on the curve is expressed in the following formula:
可以简化为以下公式:
which can be simplified into the following formula:
图 20.8显示了计算贝塞尔曲线上点坐标的 CUDA C 程序。代码的第一部分定义了将在内核代码中使用的 2D 坐标的几个运算符( operator+、operator−、operator*、length )。它们应该是不言自明的,因此我们不会对其进行详细说明。
Figure 20.8 shows a CUDA C program that calculates the coordinates of points on a Bezier curve. The first part of the code defines several operators (operator+, operator−, operator∗, length) for 2D coordinates that will be used in the kernel code. They should be quite self-explanatory so we will not elaborate on them.
图20.8没有动态并行的贝塞尔曲线计算。
Figure 20.8 Bezier curve calculation without dynamic parallelism.
主函数(第 20 行)将一组控制点初始化为随机值(第 22、23 和 24 行)。在实际应用中,这些控制点是最有可能来自用户的输入。控制点是bLines_h数组的一部分,其元素类型BezierLine在第 1 行中声明。bLines_h数组的存储在第 21 行中分配。然后,主机代码为bLines_d数组分配相应的设备内存,并复制初始化的值。数据到bLines_d(第 26-28 行)。然后调用computeBezierLine()内核来计算贝塞尔曲线的坐标。
The main function (line 20) initializes a set of control points to random values (lines 22, 23, and 24). In a real application, these control points are most likely inputs from a user. The control points are part of the bLines_h array of which the element type BezierLine is declared in line 1. The storage for the bLines_h array is allocated in line 21. The host code then allocates the corresponding device memory for the bLines_d array and copies the initialized data to bLines_d (lines 26–28). It then calls the computeBezierLine() kernel to calculate the coordinates of the Bezier curve.
computeBezierLine ()内核设计为使用线程块来计算一组三个控制点(二次贝塞尔公式)的曲线点。每个线程块首先计算由三个控制点定义的曲线曲率的度量。直观上,曲率越大,为三个控制点绘制平滑的二次贝塞尔曲线所需的点就越多。这定义了每个线程块要完成的工作量。这反映在第 3 行和第 4 行中,其中当前线程块要计算的总点数与曲率值成正比。
The computeBezierLine() kernel is designed to use a thread block to calculate the curve points for a set of three control points (of the quadratic Bezier formula). Each thread block first computes a measure of the curvature of the curve defined by the three control points. Intuitively, the larger the curvature, the more the points it takes to draw a smooth quadratic Bezier curve for the three control points. This defines the amount of work to be done by each thread block. This is reflected in lines 3 and 4, where the total number of points to be calculated by the current thread block is proportional to the curvature value.
在第 5 行的for循环中,所有线程在每次迭代中计算一组连续的贝塞尔曲线点。循环体中的详细计算基于我们之前提出的公式。关键点在于,一个块中的线程所执行的迭代次数可能与另一个块中的线程所执行的迭代次数有很大不同。根据调度策略,每个线程块完成的工作量的这种变化可能会导致流式多处理器的利用率降低,从而降低性能。
In the for loop in line 5, all threads calculate a consecutive set of Bezier curve points in each iteration. The detailed calculation in the loop body is based on the formula we presented earlier. The key point is that the number of iterations taken by threads in a block can be very different from that taken by threads in another block. Depending on the scheduling policy, such variation of the amount of work done by each thread block can result in decreased utilization of streaming multiprocessors and thus reduced performance.
图 20.9显示了使用动态并行性的贝塞尔曲线计算代码。它将图 20.8中的computeBezierLine()内核分成两个内核。第一部分,computeBezierLineCDP(),发现工作量为每个控制点完成。第二部分,computeBezierLinePositions(),执行计算。
Figure 20.9 shows a Bezier curve calculation code using dynamic parallelism. It breaks the computeBezierLine() kernel in Figure 20.8 into two kernels. The first part, computeBezierLineCDP(), discovers the amount of work to be done for each control point. The second part, computeBezierLinePositions(), performs the calculation.
图 20.9具有动态并行性的 Bezier 计算。
Figure 20.9 Bezier calculation with dynamic parallelism.
通过新的组织, computeBezierLinesCDP()内核为每组控制点完成的工作量比原始的computeBezierLines()内核小得多。因此,我们在computeBezierLinesCDP()中使用一个线程来完成这项工作,而不是在computeBezierLinesPossitions()中使用一个块。在第 13 行中,我们只需要为每组控制点启动一个线程。这通过将N_LINES除以BLOCK_DIM来形成内核启动配置中的块数来反映。
With the new organization, the amount of work done for each set of control points by the computeBezierLinesCDP() kernel is much smaller than the original computeBezierLines() kernel. Therefore, we use one thread to do this work in computeBezierLinesCDP(), as opposed to using one block in computeBezierLinesPossitions(). In line 13, we only need to launch one thread per set of control points. This is reflected by dividing the N_LINES by BLOCK_DIM to form the number of blocks in the kernel launch configuration.
computeBezierLinesCDP()内核和computeBezierLines()内核之间有两个主要区别。首先,用于访问控制点的索引是基于线程(第 6 行)而不是基于块形成的。这是因为每个控制点的工作是由线程而不是我们之前提到的块完成的。其次,用于存储计算出的 Bezier 曲线点的内存是在第 8 行中动态确定和分配的。这允许代码为BezierLine类型中的每组控制点分配足够的内存。请注意,在图 20.8中,每个BezierLine元素都声明有最大可能的点数。另一方面,图 20.9中的声明只有一个指向动态分配存储的指针。在控制点曲率变化很大的情况下,允许内核调用cudaMalloc()函数可以显着减少内存使用量。
There are two key differences between the computeBezierLinesCDP() kernel and the computeBezierLines() kernel. First, the index used to access the control points is formed on a thread basis (line 6) rather than a block basis. This is because the work for each control point is done by a thread rather than a block as we mentioned before. Second, the memory for storing the calculated Bezier curve points is dynamically determined and allocated in line 8. This allows the code to assign just enough memory to each set of control points in the BezierLine type. Note that in Figure 20.8, each BezierLine element is declared with a maximal possible number of points. On the other hand, the declaration in Figure 20.9 has only a pointer to a dynamically allocated storage. Allowing a kernel to call the cudaMalloc() function can lead to substantial reduction of memory usage for situations where the curvature of control points varies significantly.
一旦computeBezierLinesCDP()内核的线程确定了其控制点集所需的工作量,它就会启动computeBezierPositions()内核来完成工作(第9行)。在我们的示例中,父网格中的每个线程都会为其分配的一组控制点创建一个新网格。这样,每个线程块完成的工作就得到了平衡。每个子网格完成的工作量各不相同。
Once a thread of the computeBezierLinesCDP() kernel determines the amount of work needed by its set of control points, it launches the computeBezierPositions() kernel to do the work (line 9). In our example, every thread from the parent grid creates a new grid for its assigned set of control points. This way, the work done by each thread block is balanced. The amount of work done by each child grid varies.
在computeBezierLinesCDP()内核终止后,主函数可以将数据复制回来并在输出设备上绘制曲线。它还可以调用内核来并行释放分配给bLines_d存储的所有存储(第 14 行)。这比在循环中顺序调用cudaFree()函数更快。
After the computeBezierLinesCDP() kernel terminates, the main function can copy the data back and draw the curve on an output device. It can also call a kernel to free all storage allocated to the bLines_d storage in parallel (line 14). This can be faster than sequentially calling the cudaFree() function in a loop.
CUDA 动态并行性扩展了 CUDA 编程模型,以允许内核启动内核。这允许每个线程动态地发现工作并根据工作量启动新的网格。它还支持线程动态分配设备内存。正如我们在贝塞尔曲线计算示例中所示,这些扩展可以实现线程和块之间更好的工作平衡以及更有效的内存使用。
CUDA dynamic parallelism extends the CUDA programming model to allow kernels to launch kernels. This allows each thread to dynamically discover work and launch new grids according to the amount of work. It also supports dynamic allocation of device memory by threads. As we show in the Bezier curve calculation example, these extensions can lead to better work balance across threads and blocks as well as more efficient memory usage.
1. 贝塞尔曲线,可在:< http://en.wikipedia.org/wiki/B%C3%A9zier_curve >,2012 年。
1. Bezier Curves, Available at: <http://en.wikipedia.org/wiki/B%C3%A9zier_curve>, 2012.
21.1 重新审视目标
21.1 Goals Revisited
21.2 内存模型的演变
21.2 Memory Model Evolution
21.3 内核执行控制的演变
21.3 Kernel Execution Control Evolution
21.4 核心性能
21.4 Core Performance
21.5 编程环境
21.5 Programming Environment
21.6 未来展望
21.6 Future Outlook
你做到了!我们已经到达终点线了。在最后一章中,我们将简要回顾通过本书所实现的目标。我们不会得出结论,而是提出对大规模并行处理器架构未来发展的愿景,以及这些进步将如何影响并行应用程序开发。
You made it! We have arrived at the finishing line. In this final chapter, we will briefly review the goals that we have achieved through this book. Instead of drawing a conclusion, we will offer our vision for the future evolution of massively parallel processor architectures and how the advancements will impact parallel application development.
正如我们在第 1 章中所述,我们的主要目标是教读者如何对大规模并行处理器进行编程。我们承诺,一旦你形成了正确的洞察力并以正确的方式去做,事情就会变得容易。特别是,我们承诺专注于计算思维技能,使您能够以适合并行计算的方式思考问题。
As we stated in Chapter 1, our primary goal is to teach you, the readers, how to program massively parallel processors. We promised that it would become easy once you develop the right insight and go about it the right way. In particular, we promised to focus on computational thinking skills that would enable you to think about problems in ways that are amenable to parallel computing.
我们通过介绍 CUDA 的性能考虑因素(第 6 章)、三种并行模式(第 8 章、第 9 章和第10 章)、两个详细的应用案例研究(第 11 章和第12 章)以及专门讨论计算思维的一章来兑现这些承诺。技能(第13章)。通过这个过程,我们介绍了相关的计算机了解高性能并行编程中必须解决的硬件限制所需的体系结构知识。我们特别关注内存带宽限制,它仍然是大规模并行计算系统中主要的性能限制因素(第4、5、6、8、9、10、11、12和13章)。我们还介绍了浮点精度/准确度和数值稳定性的概念,以及它们与并行算法的关系(第 7 章)。有了这些见解,高性能并行编程就成为一个可管理的过程,而不是一种黑魔法。
We delivered on these promises through an introduction to performance considerations for CUDA (Chapter 6), three parallel patterns (Chapters 8, 9, and 10), two detailed application case studies (Chapters 11 and 12), and a chapter dedicated to computational thinking skills (Chapter 13). Through this process, we introduced the pertinent computer architecture knowledge needed to understand the hardware limitations that must be addressed in high-performance parallel programming. In particular, we focused on the memory bandwidth limitations that will remain as the primary performance limiting factor in massively parallel computing systems (Chapters 4, 5, 6, 8, 9, 10, 11, 12, and 13). We also introduced the concept of floating-point precision/accuracy and numerical stability, and how they relate to parallel algorithms (Chapter 7). With these insights, high-performance parallel programming becomes a manageable process, rather than a black art.
我们表示,我们的第二个目标是教授高性能并行编程风格,从而自然地避免微妙的正确性问题。为了兑现这一承诺,我们展示了基于屏障同步的简单数据并行 CUDA 编程模型(第 3 章和第 4 章)可用于开发非常高性能的应用程序。这种严格的并行编程方式自然避免了困扰许多其他并行编程系统的微妙竞争条件。
We stated that our second goal was to teach high-performance parallel programming styles that naturally avoid subtle correctness issues. To deliver on this promise, we showed that the simple data-parallel CUDA programming model (Chapters 3 and 4) based on barrier synchronization can be used to develop very high-performance applications. This disciplined way of parallel programming naturally avoids the subtle race conditions that plague many other parallel programming systems.
我们承诺教授并行编程风格,这些编程风格可以在未来的硬件世代中透明地扩展,而未来的硬件将变得越来越并行。通过 CUDA 线程模型(第 4 章),大量线程块可以按彼此之间的任意顺序执行。您的应用程序将能够从未来出现的更多并行硬件中受益。我们还介绍了算法技术,例如平铺和截止,使您的应用程序能够自然地扩展到非常大的数据集(第8、9、10、11、12和13章)。
We promised to teach parallel programming styles that transparently scale across future hardware generations, which will be more and more parallel. With the CUDA threading model (Chapter 4), a massive number of thread blocks can be executed in any order relative to each other. Your application will be able to benefit from more parallel hardware coming in the future. We also presented algorithm techniques, such as tiling and cutoff, that allow your application to scale naturally to very large data sets (Chapters 8, 9, 10, 11, 12, and 13).
我们承诺以这样的方式教授编程技能,使您能够将它们应用到其他编程模型和语言中。为了帮助您扩展到其他编程模型,我们介绍了 OpenCL(第 14 章)、OpenACC(第 15 章)、Thrust(第 16 章)、CUDA FORTRAN(第 17 章)、C++ AMD(第 18 章)和 MPI-CUDA(第18章) 19)。在每一章中,我们都解释了编程模型/语言与 CUDA 的关系,以及如何将基于 CUDA 学到的技能应用到这些模型/语言中。
We promised to teach the programming skills in such a way that you will be able to apply them to other programming models and languages. To help you branch out to other programming models, we introduced OpenCL (Chapter 14), OpenACC (Chapter 15), Thrust (Chapter 16), CUDA FORTRAN (Chapter 17), C++ AMD (Chapter 18), and MPI-CUDA (Chapter 19). In each chapter, we explained how the programming model/language relates to CUDA and how you can apply the skills you learned based on CUDA to these models/languages.
我们希望您喜欢这本书。
We hope that you have enjoyed the book.
现在我们已经回顾了我们的承诺,我们想分享我们对大规模并行处理器架构即将发展的看法,以及这些进步将如何影响应用程序开发。我们希望这些展望能够帮助您了解并行编程的未来。我们的评论基于基于 NVIDIA 开普勒计算架构的 GPU 的新功能,该架构在本书出版时已上市。
Now that we have reviewed our promises, we would like to share our view of the coming evolution of the massively parallel processor architectures and how the advancements will likely impact application development. We hope that these outlooks will help you to peek into the future of parallel programming. Our comments are based on the new features in GPUs based on NVIDIA’s Kepler compute architecture that arrived at the market when this book went into press.
大型虚拟和物理地址空间。 GPU 传统上仅使用最多 32 个地址位的物理地址空间,这将 GPU DRAM 限制为 4 GB 或更少。这是因为图形应用程序不需要超过几百兆字节的帧缓冲区和纹理内存。这与 CPU 程序员多年来认为理所当然的 64 位虚拟空间和 40+ 位物理空间形成鲜明对比。然而,最近的图形应用程序提出了更高的要求。
Large virtual and physical address spaces. GPUs have traditionally used only a physical address space with up to 32 address bits, which limited the GPU DRAM to 4 gigabytes or less. This is because graphics applications have not demanded more than a few hundred megabytes of frame buffer and texture memory. This is in contrast to the 64-bit virtual space and 40+ bits of physical space that CPU programmers have been taking for granted for many years. However, more recent graphics applications have demanded more.
Fermi 和 Kepler 等较新的 GPU 系列采用了 CPU 风格的虚拟内存架构,具有 64 位虚拟地址空间和至少 40 位的物理地址空间。明显的好处是 Fermi 和 Kepler GPU 可以包含超过 4 GB 的 DRAM,并且 CUDA 内核现在可以在非常大的数据集上运行,无论是完全托管在板载 GPU DRAM 中,还是通过访问映射的主机内存。
More recent GPU families such as Fermi and Kepler have adopted CPU-style virtual memory architecture with a 64-bit virtual address space and a physical address space of at least 40 bits. The obvious benefit is that Fermi and Kepler GPUs can incorporate more than 4 gigabytes of DRAM and that CUDA kernels can now operate on very large data sets, whether hosted entirely in on-board GPU DRAM, or by accessing mapped host memory.
Fermi 虚拟内存架构还为编程模型的潜在深刻增强奠定了基础。 CPU 系统物理内存和 GPU 物理内存现在可以映射到单个共享虚拟地址空间中[GNS 2009]。共享全局地址空间允许应用程序中的所有变量具有唯一的地址。当编程工具和运行时系统向应用程序公开这种内存架构时,可以带来几个主要好处。
The Fermi virtual memory architecture also lays the foundation for a potentially profound enhancement to the programming model. The CPU system physical memory and the GPU physical memory can now be mapped within a single, shared virtual address space [GNS 2009]. A shared global address space allows all variables in an application to have unique addresses. Such memory architecture, when exposed by programming tools and a runtime system to applications, can result in several major benefits.
首先,新的运行时系统可以被设计为允许CPU和GPU在传统保护模型下访问全部应用程序数据。这种功能将允许应用程序使用单个指针系统来访问应用程序变量,从而消除当前 CUDA 编程模型中令人困惑的方面,即开发人员不得在主机函数中取消引用指向设备内存的指针。
First, new runtime systems can be designed to allow CPUs and GPUs to access the entire volume of application data under traditional protection models. Such a capability would allow applications to use a single pointer system to access application variables, removing a confusing aspect of the current CUDA programming model where developers must not dereference a pointer to the device memory in host functions.
这些变量可以驻留在 CPU 物理内存、GPU 物理内存中,甚至同时驻留在两者中。运行时和硬件可以像GMAC系统[GNS 2009]一样实现数据迁移和一致性支持。如果 CPU 函数取消引用指针并访问映射到 GPU 物理内存的变量,数据访问仍将得到服务,但延迟可能会更长。这种功能将使 CUDA 程序能够更轻松地调用尚未移植到 GPU 的遗留库。在当前的CUDA内存架构中,开发人员必须手动将数据从设备内存传输到主机内存,以使用遗留库函数在CPU上处理它们。 GMAC 基于当前的 CUDA 构建运行时 API,并为开发人员提供了依赖运行时系统来服务此类访问或手动传输数据作为性能优化的选项。然而,GMAC系统目前还没有一个干净的机制来支持多个GPU。新的虚拟内存功能将实现更加优雅的实现。
These variables can reside in the CPU physical memory, the GPU physical memory, or even both. The runtime and hardware can implement data migration and coherence support like the GMAC system [GNS 2009]. If a CPU function dereferences a pointer and accesses a variable mapped to the GPU physical memory, the data access would still be serviced, but perhaps at a longer latency. Such capability would allow the CUDA programs to more easily call legacy libraries that have not been ported to GPUs. In the current CUDA memory architecture, the developer must manually transfer data from the device memory to the host memory to use legacy library functions to process them on the CPU. GMAC is built on a current CUDA runtime API and gives the developer the option to either rely on the runtime system to service such accesses or to manually transfer data as a performance optimization. However, the GMAC system currently does not have a clean mechanism for supporting multiple GPUs. The new virtual memory capability would enable a much more elegant implementation.
最终,虚拟内存功能还将启用类似于 CUDA 2.2 中的零复制功能的机制,以允许 GPU 直接访问非常大的物理 CPU 系统内存。在某些应用领域,例如CAD,CPU物理内存系统可能具有数百GB的容量。需要这些物理内存系统是因为应用程序要求整个数据集位于“核心”。目前此类应用程序无法利用 GPU 计算。由于能够直接访问非常大的 CPU 物理内存,GPU 可以加速这些应用程序。
Ultimately, the virtual memory capability will also enable a mechanism similar to the zero-copy feature in CUDA 2.2 to allow the GPU to directly access very large physical CPU system memories. In some application areas such as CAD, the CPU physical memory system may have hundreds of gigabytes of capacity. These physical memory systems are needed because the applications require the entire data set to be “in core.” It is currently infeasible for such applications to take advantage of GPU computing. With the ability to directly access very large CPU physical memories, it becomes feasible for GPUs to accelerate these applications.
第二个潜在好处是共享全局地址空间支持多设备系统中设备之间的点对点直接数据传输。 CUDA 4.0 及更高版本使用 GPUDirect™ 功能支持此功能。在较旧的 CUDA 系统中,设备必须首先将数据传输到主机内存,然后再将数据传输到对等设备。共享全局地址空间使得运行时系统的实现能够提供API以将数据从一个设备存储器直接传输到另一设备存储器。最终,运行时系统可以设计为当设备引用彼此内存中的数据时自动执行此类传输,但仍然允许使用显式数据传输 API 作为性能优化。在 CUDA 5.0 中,不仅可以引用多 GPU 系统中其他 GPU 上的数据,还可以引用其他本地系统上 GPU 上的数据。
The second potential benefit is that the shared global address space enables peer-to-peer direct data transfer between devices in a multidevice system. This is supported in CUDA 4.0 and later, using the GPUDirect™ feature. In older CUDA systems, devices must first transfer data to the host memory before delivering them to a peer device. A shared global address space enables the implementation of a runtime system to provide an API to directly transfer data from one device memory to another device memory. Ultimately, a runtime system can be designed to automate such transfers when devices reference data in each other’s memory, but still allow the use of explicit data transfer APIs as a performance optimization. In CUDA 5.0, it is possible not only to reference data on other GPUs within a multi-GPU system, but also data on GPUs on other local systems.
第三个好处是可以直接在设备内存中实现与 I/O 相关的内存传输。在较旧的 CUDA 系统中,I/O 输入数据必须首先传输到主机内存中,然后才能复制到设备内存中。直接将数据传入和传出设备内存的能力可以显着降低复制成本并增强处理大型数据集的应用程序的性能。
The third benefit is that one can implement I/O-related memory transfers directly in and out of the device memory. In older CUDA systems, I/O input data must first be transferred into the host memory before it can be copied into the device memory. The ability to directly transfer data in and out of the device memory can significantly reduce the copying cost and enhance the performance of applications that process large data sets.
统一设备内存空间。在早期的 CUDA 内存模型中,常量内存、共享内存、本地内存和全局内存形成各自独立的地址空间。开发人员可以使用指向全局内存的指针,但不能使用其他内存。从费米架构开始,这些存储器是统一地址空间的一部分。这使得更容易抽象出哪个内存包含特定的操作数,允许程序员仅在分配期间处理这个问题,并且使将 CUDA 数据对象传递到其他过程和函数中更简单,无论它们来自哪个内存区域。它使 CUDA 代码模块更加“可组合”。也就是说,CUDA 设备函数现在可以接受可能指向这些存储器中的任何一个的指针。如果函数参数指针指向共享内存位置,则代码运行速度会更快;如果它指向全局内存位置,则代码运行速度会较慢。程序员仍然可以执行手动数据放置和传输作为性能优化。此功能将显着降低构建生产质量 CUDA 库的成本。
Unified device memory space. In early CUDA memory models, constant memory, shared memory, local memory, and global memory form their own separate address spaces. The developer can use pointers into the global memory but not others. Starting with the Fermi architecture, these memories are parts of a unified address space. This makes it easier to abstract which memory contains a particular operand, allowing the programmer to deal with this only during allocation, and making it simpler to pass CUDA data objects into other procedures and functions, irrespective of which memory area they come from. It makes CUDA code modules much more “composable.” That is, a CUDA device function can now accept a pointer that may point to any of these memories. The code would run faster if a function argument pointer points to a shared memory location and slower if it points to a global memory location. The programmer can still perform manual data placement and transfers as a performance optimization. This capability will significantly reduce the cost of building production-quality CUDA libraries.
可配置的缓存和暂存器。早期 CUDA 系统中的共享内存充当程序员管理的暂存内存,并提高了关键数据结构具有本地化、可预测访问模式的应用程序的速度。从 Fermi 架构开始,共享内存已增强为更大的片上内存,可配置为部分高速缓存和部分共享内存,从而允许覆盖可预测和不太可预测的访问模式,从而受益于片上内存记忆。这种可配置性允许程序员根据最适合其应用程序的方式分配资源。
Configurable caching and scratchpad. The shared memory in early CUDA systems served as programmer-managed scratch memory and increased the speed of applications where key data structures have localized, predictable access patterns. Starting with the Fermi architecture, the shared memory has been enhanced to a larger on-chip memory that can be configured to be partially cache memory and partially shared memory, which allows coverage of both predictable and less predictable access patterns to benefit from on-chip memory. This configurability allows programmers to apportion the resources according to the best fit for their application.
直接从 CPU 代码移植的早期设计阶段的应用程序将极大地受益于作为片上存储器的主要部分的缓存。当开发人员将 CPU 应用程序移植到 GPU 时,这将通过提高“简单性能”水平来进一步平滑性能调整过程。
Applications in an early design stage that are ported directly from CPU code will benefit greatly from caching as the dominant part of the on-chip memory. This would further smooth the performance tuning process by increasing the level of “easy performance” when a developer ports a CPU application to a GPU.
现有的 CUDA 应用程序和具有可预测访问模式的应用程序将能够将快速共享内存的使用增加三倍,同时保留与上一代设备相同的设备“占用率”。对于性能或功能受到共享内存大小限制的 CUDA 应用程序来说,大小增加三倍将是一个可喜的改进。例如,在模板计算中,例如用于计算流体动力学的有限体积方法,加载到共享存储器中的状态还包括来自邻近区域的“光环”元素。
Existing CUDA applications and those that have predictable access patterns will have the ability to increase their use of fast shared memory by a factor of three while retaining the same device “occupancy” they had on previous generation devices. For CUDA applications of which the performance or capabilities are limited by the size of the shared memory, the three times increase in size will be a welcome improvement. For example, in stencil computation such as finite volume methods for computational fluid dynamics, the state loaded into the shared memory also includes “halo” elements from neighboring areas.
光晕的相对部分随着模板尺寸的增加而减少。在 3D 仿真模型中,光环单元的数据大小可以与当前共享内存大小的主要数据相当。由于内存带宽的很大一部分花费在加载光环元素上,这会显着降低共享内存的效率。例如,如果共享内存允许线程块将 8 3 (= 512) 个单元模板加载到共享内存中,每个表面上都有一层 halo 元素,则只有 6 3 (= 216) 个,或者不到一半加载的细胞,是主要数据。加载光环元素所花费的带宽实际上比主要数据所花费的带宽要大。共享内存大小增加三倍允许其中一些应用程序具有更有利的模板大小,其中光环占共享内存中数据的比例要小得多。在我们的示例中,增加的大小将允许每个线程块加载11 3 (= 1,331) 个图块。每个表面上有一层光环单元,总共 9 3 (= 729) 个单元,或者超过加载单元的一半,是主要数据。这显着提高了内存带宽效率和应用程序的性能。
The relative portion of halo decreases as the size of the stencil increases. In 3D simulation models, the halo cells can be comparable in data size as the main data for current shared memory sizes. This can significantly reduce the effectiveness of the shared memory due to the significant portion of the memory bandwidth spent on loading the halo elements. For example, if the shared memory allows a thread block to load an 83 (= 512) cell stencil into the shared memory, with one layer of halo elements on every surface, only 63 (= 216), or less than half of the loaded cells, are the main data. The bandwidth spent on loading the halo elements is actually bigger than that spent on the main data. A three times increase in shared memory size allows some of these applications to have a more favorable stencil size where the halo accounts for a much lesser portion of the data in shared memory. In our example, the increased size would allow a 113 (= 1,331) tile to be loaded by each thread block. With one layer of halo elements on each surface, a total of 93 (= 729) cells, or more than half of the loaded elements, are main data. This significantly improves the memory bandwidth efficiency, and the performance of the application.
增强的原子操作。 Fermi 中的原子操作比以前的 CUDA 系统中的原子操作要快得多,而 Kepler 中的原子操作仍然更快。此外,开普勒原子运算更为通用。原子操作经常用于随机散布计算模式,例如直方图。更快的原子操作减少了对算法转换的需求,例如用于实现此类随机散射计算的前缀和(第 9 章)[SHZ 2007]和排序[SHG 2009] 。这些转换往往会增加执行目标计算所需的内核调用数量。更快的原子操作还可以减少主机 CPU 参与集体操作或多个线程块更新共享数据结构的算法的需要,从而减少 CPU 和 GPU 之间的数据传输压力。
Enhanced atomic operations. The atomic operations in Fermi are much faster than those in previous CUDA systems, and the atomic operations in Kepler are still faster. In addition, the Kepler atomic operations are more general. Atomic operations are frequently used in random scatter computation patterns such as histograms. Faster atomic operations reduce the need for algorithm transformations such as prefix sum (Chapter 9) [SHZ 2007] and sorting [SHG 2009] for implementing such random scattering computations. These transformations tend to increase the number of kernel invocations needed to perform the target computation. Faster atomic operations can also reduce the need for involvement of the host CPU in algorithms that do collective operations or where multiple thread blocks update shared data structures, and thus reduce the data transfer pressure between the CPU and the GPU.
增强的全局内存访问。 Fermi 和 Kepler 中的随机内存访问速度比早期的 CUDA 系统快得多。程序员可以不太关心内存合并。这使得更多的CPU算法可以直接在GPU中用作可接受的基础,进一步平滑了访问光线追踪等多种数据结构的应用程序以及其他严重面向对象且可能难以移植的应用程序的移植路径。转换为完美平铺的数组。
Enhanced global memory access. The speed of random memory access is much faster in Fermi and Kepler than earlier CUDA systems. Programmers can be less concerned about memory coalescing. This allows more CPU algorithms to be directly used in the GPU as an acceptable base, further smoothing the path of porting applications that access a diversity of data structures such as ray tracing, and other applications that are heavily object-oriented and may be difficult to convert into perfectly tiled arrays.
内核函数内的函数调用。以前的 CUDA 版本不允许在内核代码中调用函数。虽然内核函数的源代码看起来可能有函数调用,但编译器必须能够将所有函数体内联到内核对象中,以便内核函数在运行时不存在函数调用。尽管该模型对于许多应用程序的性能关键部分来说工作得相当好,但它不支持更复杂的软件工程实践。应用程序。特别是,它不支持C++等面向对象语言中的系统调用、动态链接库调用、递归函数调用和虚函数。
Function calls within kernel functions. Previous CUDA versions did not allow function calls in kernel code. Although the source code of kernel functions can appear to have function calls, the compiler must be able to inline all function bodies into the kernel object so that there is no function calls in the kernel function at runtime. Although this model works reasonably well for performance-critical portions of many applications, it does not support the software engineering practices in more sophisticated applications. In particular, it does not support system calls, dynamically linked library calls, recursive function calls, and virtual functions in object-oriented languages such as C++.
更新的设备架构(例如 Kepler)支持运行时内核函数中的函数调用。 CUDA 5.0 及更高版本支持此功能。编译器不再需要内联函数体。它仍然可以作为性能优化来这样做。此功能部分是通过 CUDA 线程的大规模并行调用帧堆栈的缓存、快速实现实现的。它允许不同的作者编写不同的 CUDA 内核组件并将它们组装在一起,而无需大量的重新设计成本,从而使 CUDA 设备代码更加“可组合”。特别是,它允许现代面向对象技术(例如虚拟函数调用)和软件工程实践(例如动态链接库)。它还允许软件供应商发布没有源代码的设备库以保护知识产权。
More recent device architectures such as Kepler support function calls in kernel functions at runtime. This feature is supported in CUDA 5.0 and later. The compiler is no longer required to inline the function bodies. It can still do so as a performance optimization. This capability is partly enabled by cached, fast implementation of massively parallel call frame stacks for CUDA threads. It makes CUDA device code much more “composable” by allowing different authors to write different CUDA kernel components and assemble them all together without heavy redesign costs. In particular, it allows modern object-oriented techniques such as virtual function calls, and software engineering practices such as dynamically linked libraries. It also allows software vendors to release device libraries without source code for intellectual property protection.
对运行时函数调用的支持允许递归,并且在程序员从传统的面向 CPU 的算法过渡到用于分而治之类型的计算的 GPU 调整方法时,将显着减轻程序员的负担。这也使得图算法的实现变得更加容易,其中数据结构遍历通常自然地涉及递归。在某些情况下,开发人员将能够将 CPU 算法“剪切并粘贴”到 CUDA 内核中,并获得性能合理的内核,尽管持续的性能调整仍会带来好处。
Support for function calls at runtime allows recursion and will significantly ease the burden on programmers as they transition from legacy CPU-oriented algorithms toward GPU-tuned approaches for divide-and-conquer types of computation. This also allows easier implementation of graph algorithms where data structure traversal often naturally involves recursion. In some cases, developers will be able to “cut and paste” CPU algorithms into a CUDA kernel and obtain a reasonably performing kernel, although continued performance tuning would still add benefit.
内核函数中的异常处理。早期的 CUDA 系统不支持内核代码中的异常处理。虽然对于许多高性能应用程序的性能关键部分来说不是一个重大限制,但它通常会在生产质量应用程序中产生软件工程成本,这些应用程序依赖异常来检测和处理罕见情况,而不执行代码来显式测试此类情况。此外,它不允许内核功能利用操作系统服务,这通常在应用程序的性能关键部分中避免使用,除非在调试情况下。
Exception handling in kernel functions. Early CUDA systems did not support exception handling in kernel code. While not a significant limitation for performance-critical portions of many high-performance applications, it often incurs software engineering costs in production-quality applications that rely on exceptions to detect and handle rare conditions without executing code to explicitly test for such conditions. Also, it does not allow kernel functions to utilize operating system services, which is typically avoided in performance-critical portions of the applications except during debugging situations.
随着异常处理和函数调用支持的可用性,内核现在可以调用标准库函数,例如printf()和malloc(),这可能会导致系统调用陷阱。根据我们的经验,在内核中调用printf() 的能力为调试和支持生产软件中的内核提供了微妙但重要的帮助。许多最终用户都是非技术人员,无法轻松接受培训来运行调试器,以便为开发人员提供有关崩溃前发生的情况的更多详细信息。在内核中执行printf() 的能力允许开发人员添加一种模式应用程序转储内部状态,以便最终用户可以提交有意义的错误报告。
With the availability of exception handling and function call support, kernels can now call standard library functions such as printf() and malloc(), which can lead to system call traps. In our experience, the ability to call printf() in the kernel provides a subtle but important aid in debugging and supporting kernels in production software. Many end users are nontechnical and cannot be easily trained to run debuggers to provide developers with more details on what happened before a crash. The ability to execute printf() in the kernel allows the developers to add a mode to the application to dump the internal state so that the end users can submit meaningful bug reports.
同时执行多个内核。以前的 CUDA 系统在任何时间点仅允许一个内核在每个 GPU 设备上执行。多个内核函数可以提交执行。然而,它们被缓冲在一个队列中,该队列在当前内核完成执行后释放下一个内核。 Fermi 及其后继者允许同时执行同一应用程序的多个内核,这减轻了应用程序开发人员将多个内核“批处理”为更大的内核以更充分地利用设备的压力。一个典型的好处示例是并行集群应用程序将工作分为“本地”和“远程”分区,其中远程工作涉及与其他节点的交互,并驻留在全局进展的关键路径上。在以前的 CUDA 系统中,内核需要很大才能保持设备高效运行,并且必须小心不要启动本地工作,以免阻止全局工作。这意味着要选择在等待远程工作到达时未充分利用设备,或者急切地开始本地工作以保持设备生产力,但代价是增加完成远程工作单元的延迟。通过多个内核执行,应用程序可以使用更小的内核大小来启动工作,因此,当高优先级远程工作到达时,它可以以低延迟开始运行,而不是陷入本地计算的大型内核后面。
Simultaneous execution of multiple kernels. Previous CUDA systems allow only one kernel to execute on each GPU device at any point in time. Multiple kernel functions can be submitted for execution. However, they are buffered in a queue that releases the next kernel after the current one completes execution. Fermi and its successors allow multiple kernels from the same application to be executed simultaneously, which reduces the pressure for the application developer to “batch” multiple kernels into a larger kernel to more fully utilize a device. A typical example of benefit is for parallel cluster applications that segment work into “local” and “remote” partitions, where remote work is involved in interactions with other nodes and resides on the critical path of global progress. In previous CUDA systems, kernels needed to be large to keep the device running efficiently, and one had to be careful not to launch local work such that global work could be blocked. This meant choosing between underutilizing the device while waiting for remote work to arrive, or eagerly starting on local work to keep the device productive at the cost of increased latency for completing remote work units. With multiple kernel execution, the application can use much smaller kernel sizes for launching work, and as a result when high-priority remote work arrives, it can start running with low latency instead of being stuck behind a large kernel of local computation.
在 Kepler 和 CUDA 5.0 中,通过添加多个硬件队列来扩展多内核启动功能,这允许更有效地调度来自多个内核(包括多个流中的内核)的块。此外,CUDA 动态并行功能允许 GPU 工作创建:GPU 内核可以以数据相关或计算负载相关的方式异步、动态地启动子内核。这减少了 CPU-GPU 交互和同步,因为 GPU 现在可以独立管理更复杂的工作负载。 CPU 又可以自由地执行其他有用的计算。
In Kepler and CUDA 5.0, the multiple kernel launch facility is extended by the addition of multiple hardware queues, which allow much more efficient scheduling of blocks from multiple kernels including kernels in multiple streams. In addition, the CUDA dynamic parallelism feature allows GPU work creation: GPU kernels can launch child kernels, asynchronously, dynamically, and in a data-dependent or compute load-dependent fashion. This reduces CPU–GPU interaction and synchronization, since the GPU can now manage more complex workloads independently. The CPU is in turn free to perform other useful computation.
可中断内核。 Fermi 允许“取消”正在运行的内核,从而简化了 CUDA 加速应用程序的创建,允许用户随时中止长时间运行的计算,而无需程序员进行大量设计工作。一旦软件支持可用,这将能够实现用户级任务调度系统,该系统可以更好地执行计算系统的 GPU 节点之间的负载平衡,并允许更优雅地处理一个 GPU 负载过重且运行速度可能比 GPU 慢的情况。其同行[SH 2009]。
Interruptable kernels. Fermi allows the running kernel to be “canceled,” which eases the creation of CUDA-accelerated applications that allow the user to abort a long-running calculation at any time, without requiring significant design effort on the part of the programmer. Once software support is available, this will enable implementation of user-level task scheduling systems that can better perform load balance between GPU nodes of a computing system, and allows more graceful handling of cases where one GPU is heavily loaded and may be running slower than its peers [SH 2009].
双精度速度。早期设备执行双精度浮点运算,与单精度相比,速度显着降低(大约慢八倍)。费米及其后继者的浮点运算单元得到了显着增强,能够以大约单精度一半的速度执行双精度运算。大量使用双精度浮点运算的应用程序受益匪浅。其他谨慎且谨慎地使用双精度的应用程序对性能的影响较小。
Double-precision speed. Early devices perform double-precision floating-point arithmetic with significant speed reduction (around eight times slower) compared to single precision. The floating-point arithmetic units of Fermi and its successors have been significantly strengthened to perform double-precision arithmetic at about half the speed of single precision. Applications that are intensive in double-precision floating-point arithmetic benefit tremendously. Other applications that use double precision carefully and sparingly see less performance impact.
实际上,将基于 CPU 的数值应用程序移植到 GPU 的开发人员可能会获得最显着的好处。随着双精度速度的提高,他们将没有动力花精力去评估他们的应用程序或部分应用程序是否适合单精度。这可以显着降低将CPU应用程序移植到GPU的开发成本,并解决高性能计算社区对GPU的主要批评。由于使用 32 位数据与 64 位数据相比带宽减少,一些运行较小尺寸输入数据(8 位、16 位或单精度浮点)的应用程序可能会继续受益于使用单精度算术。 。医学成像、遥感、射电天文学、地震分析和其他自然数据等应用通常属于这一类别。
In practice, the most significant benefit will likely be obtained by developers who are porting CPU-based numerical applications to GPUs. With the improved double-precision speed, they will have little incentive to spend the effort to evaluate whether their applications or portions of their applications can fit into single precision. This can significantly reduce the development cost for porting CPU applications to GPUs, and addresses a major criticism of GPUs by the high-performance computing community. Some applications that are operating on smaller size input data (8 bits, 16 bits, or single-precision floating point) may continue to benefit from using single-precision arithmetic, due to the reduced bandwidth of using 32-bit versus 64-bit data. Applications such as medical imaging, remote sensing, radio astronomy, seismic analysis, and other natural data frequently fit into this category.
更好地控制流动效率。 Fermi 采用通用编译器驱动的预测技术[MHM1995],与以前的 CUDA 系统相比,它可以更有效地处理控制流。虽然该技术在 VLIW 系统中取得了一定的成功,但它可以在 GPU warp 式 SIMD 执行系统中提供更显着的速度改进。此功能可能会扩大可以利用 GPU 的应用程序范围。特别是,对于数据驱动的应用程序(例如光线追踪、量子化学可视化[SSH2009]和细胞自动机模拟),可以实现主要的性能优势。
Better control flow efficiency. Fermi adopts a general compiler-driven predication technique [MHM1995] that can more effectively handle control flow than previous CUDA systems. While this technique was moderately successful in VLIW systems, it can provide more dramatic speed improvements in GPU warp-style SIMD execution systems. This capability can potentially broaden the range of applications that can take advantage of GPUs. In particular, major performance benefits can potentially be realized for applications that are very data-driven, such as ray tracing, quantum chemistry visualization [SSH2009], and cellular automata simulation.
未来的 CUDA 编译器将增强对 C++ 模板和内核函数中的虚拟函数调用的支持。尽管硬件增强功能(例如在运行时进行函数调用的能力)已经到位,但编译器中增强的 C++ 语言支持仍在继续更多时间。在不久的将来,C++ try/catch 功能也可能在内核函数中得到完全支持。通过这些增强功能,未来的 CUDA 编译器将支持大多数主流 C++ 功能。内核函数中的其余功能(例如 new、delete、构造函数和析构函数)可能会在以后的编译器版本中提供。
Future CUDA compilers will include enhanced support for C++ templates and virtual function calls in kernel functions. Although the hardware enhancements, such as the ability to make function calls at runtime, are in place, enhanced C++ language support in the compiler has been taking more time. The C++ try/catch features will also likely be fully supported in kernel functions in the near future. With these enhancements, future CUDA compilers will support most mainstream C++ features. The remaining features in kernel functions such as new, delete, constructors, and destructors will likely be available in later compiler releases.
新的和不断发展的编程接口将继续提高异构并行程序员的生产力。正如我们在第 15 章中所示,OpenACC 允许开发人员使用编译器指令注释其顺序循环,以使编译器能够生成 CUDA 内核。在第 16 章中,我们展示了可以使用并行类型通用函数、类和迭代器的 Thrust 库来描述它们的计算,并具有生成和配置实现计算的内核的底层机制。在第 17 章中,我们介绍了 CUDA FORTRAN,它允许 FORTRAN 程序员用他们熟悉的语言开发 CUDA 内核。特别是,该接口为多维数组的索引提供了强大的支持。在第 18 章中,我们概述了 C++ AMP 接口,该接口允许开发人员将其内核描述为对逻辑数据结构(例如 C++ 应用程序中的多维数组)进行操作的并行循环。我们完全期望新的创新将不断出现,以进一步提高开发人员在这个令人兴奋的领域的生产力。
New and evolved programming interfaces will continue to improve the productivity of heterogeneous parallel programmers. As we showed in Chapter 15, OpenACC allows developers to annotate their sequential loops with compiler directives to enable a compiler to generate CUDA kernels. In Chapter 16, we show that one can use the Thrust library of parallel type-generic functions, classes, and iterators to describe their computation and have the underlying mechanism to generate and configure the kernels that implement the computation. In Chapter 17, we presented CUDA FORTRAN that allows FORTRAN programmers to develop CUDA kernels in their familiar language. In particular, this interface offers strong support for indexing into multidimensional arrays. In Chapter 18, we gave an overview of the C++ AMP interface that allow the developers to describe their kernels as parallel loops that operate on logical data structures, such as multidimensional arrays in a C++ application. We fully expect that new innovations will continue to arise to further boost the productivity of developers in this exciting area.
新的CUDA 5.0 SDK和基于Kepler架构的新GPU标志着第四代GPU计算的开始,第四代GPU计算真正强调对开发人员生产力和现代软件工程实践的支持。借助新功能,能够以最低的开发成本获得合理性能的应用程序范围将显着扩大。我们预计开发人员会立即注意到与以前的 CUDA 系统相比,应用程序开发、移植和维护成本的降低。使用 Thrust 和自动生成 CUDA 代码的类似高级工具开发的现有应用程序也可能会立即提升其性能。虽然内存架构、内核执行控制和计算核心性能方面的硬件增强的好处将在相关的 SDK 版本中显现出来,但这些增强的真正潜力可能需要数年时间才能在 SDK 和运行时中得到充分利用。例如,硬件虚拟内存的真正潜力只有当支持直接 GPU I/O 和多 GPU 系统点对点数据传输的共享全局地址空间运行时广泛可用时,这种能力才可能完全实现。我们预测,未来几年,工业界和学术界将迎来大规模并行计算编程工具和运行时环境创新的激动人心的时刻。
The new CUDA 5.0 SDK and the new GPUs based on the Kepler architecture mark the beginning of the fourth generation of GPU computing that places real emphasis on support for developer productivity and modern software engineering practices. With the new capabilities, the range of applications that will be able to get reasonable performance at minimal development cost will expand significantly. We expect that developers will immediately notice the reduction in application development, porting, and maintenance cost compared to previous CUDA systems. The existing applications developed with Thrust and similar high-level tools that automatically generate CUDA code will also likely get an immediate boost in their performance. While the benefit of hardware enhancements in memory architecture, kernel execution control, and compute core performance will be visible in the associated SDK release, the true potential of these enhancements may take years to be fully exploited in the SDKs and runtimes. For example, the true potential of the hardware virtual memory capability will likely be fully achieved only when a shared global address space runtime that supports direct GPU I/O and peer-to-peer data transfer for multi-GPU systems becomes widely available. We predict an exciting time for innovations from both industry and academia in programming tools and runtime environments for massively parallel computing in the next few years.
享受车程!
Enjoy the ride!
1. Gelado, I.、Navarro, N.、Stone, J.、Patel, S. 和 Hwu, WW (2009)。异构并行系统的非对称分布式共享内存模型,技术报告,IMPACT Group,伊利诺伊大学厄巴纳-香槟分校。
1. Gelado, I., Navarro, N., Stone, J., Patel, S., & Hwu, W. W. (2009). An asymmetric distributed shared memory model for heterogeneous parallel systems, Technical Report, IMPACT Group, University of Illinois, Urbana-Champaign.
2. Mahlke, SA、Hank, RE、MCormick, JE、August, DI 和 Hwu, WW(1995 年 6 月)。 ILP 处理器的完整和部分谓词执行支持的比较,第 22 届计算机体系结构国际研讨会论文集,意大利圣玛格丽塔利古雷,第 138-150 页。
2. Mahlke, S. A., Hank, R. E., MCormick, J. E., August, D. I., & Hwu, W. W. (June 1995). A comparison of full and partial predicated execution support for ILP processors, Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, pp. 138–150.
3. Stone, JE, & Hwu, WW (2009).WorkForce:管理多 GPU 计算的轻量级框架,技术报告,IMPACT Group,伊利诺伊大学香槟分校。
3. Stone, J. E., & Hwu, W. W. (2009).WorkForce: A Lightweight Framework for Managing Multi-GPU Computations, Technical Report, IMPACT Group, University of Illinois, Urbana-Champaign.
4.萨蒂什,N.,哈里斯,M.,和加兰。 M.(2009 年 5 月)。为许多核心 GPU 设计高效的排序算法,第 23 届 IEEE 国际并行和分布式处理研讨会论文集,意大利罗马,第 177-187 页。
4. Satish, N., Harris, M., & Garland. M. (May 2009). Designing efficient sorting algorithms for many core GPUs, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, pp. 177-187.
5. Sengupta, S.、Harris, M.、Zhang, Y. 和 Owens, JD(2007 年 8 月)。用于 GPU 计算的扫描基元,2007 年图形硬件论文集,加利福尼亚州圣地亚哥,第 97-106 页。
5. Sengupta, S., Harris, M., Zhang, Y., & Owens, J. D. (Aug. 2007). Scan Primitives for GPU computing, Proceedings of Graphics Hardware 2007, San Diego, California, pp. 97–106.
6. Stone, JE、Saam, J.、Hardy, DJ、Vandivort, KL、Hwu, WW 和 Schulten, K.(2009 年 3 月 8 日)。 GPU 和多核 CPU 上分子轨道的高性能计算和交互式显示,第二届 GPGPU 研讨会,ACM/IEEE 编程语言和操作系统架构支持会议 (ASPLOS),第 9-18 页。
6. Stone, J. E., Saam, J., Hardy, D. J., Vandivort, K. L., Hwu, W. W., & Schulten, K. (March 8, 2009). High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs, the second GPGPU workshop, ACM/IEEE Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 9–18.
本附录显示了一个仅主机源代码,可用作 CUDA 矩阵乘法代码的基础。我们已经在关键位置插入了计时器调用,以便您可以使用测量来隔离实际执行矩阵乘法的函数的执行时间。它还具有可用于打印矩阵内容并验证结果的代码。
This appendix shows a host-only source code that can be used as the base of your CUDA matrix multiplication code. We have already inserted timer calls in key places so that you can use the measurement to isolate the execution time of the function that actually performs the matrix multiplication. It also has the code that you can use to print out the matrix contents and verify the results.
/******************************************************************************************************************************************************** *******************************
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
文件名 [matrixmul.cu]
File Name [matrixmul.cu]
概要 [此文件定义了进行矩阵-矩阵乘法的主要函数。]
Synopsis [This file defines the main function to do matrix-matrixmultiplication.]
描述 []
Description []
********************************************************************************************************************************************************************************************************************************** **************************/
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
//------------------------------------------------ ----------
//----------------------------------------------------------
// 包含的 C 库
// Included C libraries
//------------------------------------------------ ----------
//----------------------------------------------------------
#include <stdlib.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdio.h>
#include <字符串.h>
#include <string.h>
//------------------------------------------------ ----------
//----------------------------------------------------------
// 包含的 CUDA 库
// Included CUDA libraries
//------------------------------------------------ ----------
//----------------------------------------------------------
#include <cutil.h>
#include <cutil.h>
//------------------------------------------------ ----------
//----------------------------------------------------------
// 包含辅助函数
// Included helper functions
//------------------------------------------------ ----------
//----------------------------------------------------------
#include“assist.h”
#include "assist.h"
//------------------------------------------------ ----------
//----------------------------------------------------------
//包含主机矩阵-矩阵乘法函数原型
//Included host matrix-matrix multiplication function prototype
//------------------------------------------------ ----------
//----------------------------------------------------------
#include“matrixmul.h”
#include "matrixmul.h"
/*==============================*/
/∗=============================∗/
/**/
/∗ ∗/
/* 概要【主要功能】*/
/∗ Synopsis [Main function] ∗/
/*描述[]*/
/∗ Description [] ∗/
/**/
/∗ ∗/
/*==============================*/
/∗=============================∗/
整数
int
main(int argc, char**argv)
main(int argc, char∗∗ argv)
{
{
布尔 if_quiet = false;
bool if_quiet = false;
无符号整数timer_compute = 0;
unsigned int timer_compute = 0;
整数 i,j;
int i, j;
char*matrix_id=NULL,*input_fn=NULL,*gold_fn=NULL;
char ∗ matrix_id = NULL, ∗ input_fn = NULL, ∗ gold_fn = NULL;
int Mw = 0,Mh = 0,Nw = 0,Nh = 0,Pw = 0,Ph = 0;
int Mw = 0, Mh = 0, Nw = 0, Nh = 0, Pw = 0, Ph = 0;
if (argc == 2) {
if (argc == 2) {
矩阵id = strdup(argv[1]);
matrix_id = strdup(argv[1]);
} 别的 {
} else {
fprintf(stderr, "错误: 输入参数编号错误。\n");
fprintf(stderr, "Error: Wrong input parameter numbers.\n");
fprintf(stderr, "用法:\n"
fprintf(stderr, "Usage:\n"
"$>./lab1.1-matrixmul <8, 128, 512, 3072, 4096>\n"
"$>./lab1.1-matrixmul <8, 128, 512, 3072, 4096>\n"
“示例:\n”
"Examples:\n"
“ $>./lab1.1-matrixmul 128\n”
" $>./lab1.1-matrixmul 128\n"
);
);
退出(1);
exit(1);
}
}
Mw = Mh = Nw = Nh = Pw = Ph = atoi(matrix_id);
Mw = Mh = Nw = Nh = Pw = Ph = atoi(matrix_id);
input_fn = (char *) malloc(30*sizeof(char));
input_fn = (char ∗) malloc(30∗sizeof(char));
gold_fn = (char *) malloc(30*sizeof(char));
gold_fn = (char ∗) malloc(30∗sizeof(char));
sprintf(input_fn,“matrix_%s.bin”,matrix_id);
sprintf(input_fn, "matrix_%s.bin", matrix_id);
sprintf(gold_fn,“matrix_%s.gold”,matrix_id);
sprintf(gold_fn, "matrix_%s.gold", matrix_id);
if_quiet = true; // 如果不显示矩阵内容
if_quiet = true; // If not display matrix contents
}
}
printf("输入矩阵大小:%d x %d\n", Mw, Mh);
printf("Input matrix size: %d by %d\n", Mw, Mh);
//------------------------------------------------ ----------
//----------------------------------------------------------
// 设置主机端
// Setup host side
//------------------------------------------------ ----------
//----------------------------------------------------------
printf("设置主机端环境:\n");
printf("Setup host side environment:\n");
// 为矩阵 M 和 N 分配主机内存
// allocate host memory for matrices M and N
printf("为矩阵 M 和 N 分配主机内存。\n");
printf(" Allocate host memory for matrices M and N.\n");
printf(" M: %dx %d\n", Mw, Mh);
printf(" M: %d x %d\n", Mw, Mh);
printf(" M: %dx %d\n", Mw, Mh);
printf(" M: %d x %d\n", Mw, Mh);
printf(" N: %dx %d\n", Nw, Nh);
printf(" N: %d x %d\n", Nw, Nh);
无符号整型 size_M = Mw * Mh;
unsigned int size_M = Mw ∗ Mh;
无符号 int mem_size_M = sizeof(float) * size_M;
unsigned int mem_size_M = sizeof(float) ∗ size_M;
float* hostM = (float*) malloc(mem_size_M);
float∗ hostM = (float∗) malloc(mem_size_M);
无符号整型 size_N = Nw * (Nh);
unsigned int size_N = Nw ∗ (Nh);
无符号 int mem_size_N = sizeof(float) * size_N;
unsigned int mem_size_N = sizeof(float) ∗ size_N;
float* hostN = (float*) malloc(mem_size_N);
float∗ hostN = (float∗) malloc(mem_size_N);
// 为主机端的结果分配内存
// allocate memory for the result on host side
printf("在主机端为结果分配内存。\n");
printf(" Allocate memory for the result on host side.\n");
无符号整型 size_P = Pw * Ph;
unsigned int size_P = Pw ∗ Ph;
无符号 int mem_size_P = sizeof(float) * size_P;
unsigned int mem_size_P = sizeof(float) ∗ size_P;
float* hostP = (float*) malloc(mem_size_P);
float∗ hostP = (float∗) malloc(mem_size_P);
// 初始化输入矩阵。
// Initialize the input matrices.
printf(" 生成矩阵 M 和 N 的输入矩阵数据。\n");
printf(" Generate input matrix data for matrix M and N.\n");
GenMatrixFile(input_fn, Pw, Ph, if_quiet);
GenMatrixFile(input_fn, Pw, Ph, if_quiet);
无符号整型 * 矩阵 = ReadMatrixFile(input_fn, Pw, Ph, true);
unsigned int ∗ matrix = ReadMatrixFile(input_fn, Pw, Ph, true);
对于 (i = 0; i < Mw; i++)
for (i = 0; i < Mw; i++)
对于 (j = 0; j < Nw; j++)
for (j = 0; j < Nw; j++)
hostM[i * Mw + j] = hostN[i * Mw + j] = (float) 矩阵[i*Mw + j];
hostM[i ∗ Mw + j] = hostN[i ∗ Mw + j] = (float) matrix[i∗Mw + j];
自由(矩阵);矩阵=空;
free(matrix); matrix = NULL;
//====================== =======
//====================== =======
// 进行矩阵-矩阵乘法
// Do matrix-matrix multiplication
//==============================
//============= ================
printf("计算矩阵乘法M x N:\n");
printf(" Computing matrix multiplication M x N:\n");
如果 (Pw*Ph > 512*512) {
if (Pw∗Ph > 512∗512) {
printf(" (矩阵大于 512by512,需要时间。\n");
printf(" (It takes time since matrix is larger than 512by512.\n");
}
}
CUT_SAFE_CALL(cutCreateTimer(&timer_compute));
CUT_SAFE_CALL(cutCreateTimer(&timer_compute));
CUT_SAFE_CALL(cutStartTimer(timer_compute));
CUT_SAFE_CALL(cutStartTimer(timer_compute));
float* 参考 = (float*) malloc(mem_size_P);
float∗ reference = (float∗) malloc(mem_size_P);
计算黄金(参考,主机M,主机N,Mh,Mw,Nw);
computeGold(reference, hostM, hostN, Mh, Mw, Nw);
CUT_SAFE_CALL(cutStopTimer(timer_compute));
CUT_SAFE_CALL(cutStopTimer(timer_compute));
printf("CPU 处理时间: %f (ms)\n",
printf(" CPU Processing time : %f (ms)\n",
cutGetTimerValue(timer_compute));
cutGetTimerValue(timer_compute));
CUT_SAFE_CALL(cutDeleteTimer(timer_compute));
CUT_SAFE_CALL(cutDeleteTimer(timer_compute));
printf(" 矩阵数据校验和:%g\n", CheckSum(reference, Mw, Nw));
printf(" Matrix data checksum : %g\n", CheckSum(reference, Mw, Nw));
如果(!if_quiet){
if (!if_quiet) {
printf("矩阵数据内容:\n");
printf(" Matrix data contents :\n");
printf(“”);
printf(" ");
}
}
矩阵 = (无符号整型 *) malloc(Pw * Ph * sizeof(无符号整型));
matrix = (unsigned int ∗) malloc(Pw ∗ Ph ∗ sizeof(unsigned int));
for (i = 0; i < Ph; i++) {
for (i = 0; i < Ph; i++) {
for (j = 0; j < Pw; j++) {
for (j = 0; j < Pw; j++) {
矩阵[i*Pw + j] = (无符号整数) 参考[i*Pw + j];
matrix[i∗Pw + j] = (unsigned int) reference[i∗Pw + j];
if (!if_quiet) printf("%u ", 矩阵[i*Pw + j]);
if (!if_quiet) printf("%u ", matrix[i∗Pw + j]);
}
}
if (!if_quiet) printf("\n ");
if (!if_quiet) printf("\n ");
}
}
if (!if_quiet) printf("\n");
if (!if_quiet) printf("\n");
WriteMatrixFile(gold_fn, 矩阵, Pw, Ph, 1);
WriteMatrixFile(gold_fn, matrix, Pw, Ph, 1);
自由(矩阵);矩阵=空;
free(matrix); matrix = NULL;
免费(参考);
free(reference);
// 清理内存
// clean up memory
自由(主机M);自由(主机N);自由(主机P);
free(hostM); free(hostN); free(hostP);
自由(input_fn);自由(gold_fn);
free(input_fn); free(gold_fn);
返回0;
return 0;
}
}
这个矩阵乘法函数的“黄金”版本可用于验证并行实现的结果。
This “gold” version of the matrix multiplication function can be used to verify the results of your parallel implementation.
/******************************************************************************************************************************************************** ***************************
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
文件名 [matrixmul_gold.cpp]
File Name [matrixmul_gold.cpp]
概要 [此文件定义了黄金版矩阵-矩阵
Synopsis [This file defines the gold-version matrix-matrix
乘法。]
multiplication.]
描述 []
Description []
********************************************************************************************************************************************************************************************************************************** **************************/
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
#include <stdio.h>
#include <stdio.h>
#include“matrixmul.h”
#include "matrixmul.h"
/*============================*/
/∗====================== ======∗/
/* 概要 [矩阵-矩阵乘法的顺序/黄金版本。]*/
/∗ Synopsis [Sequential/Gold version of matrix-matrix multiplication.] ∗/
/**/
/∗ ∗/
/* 描述 [该函数计算两个矩阵 M 和 N 的乘法,*/
/∗ Description [This function computes multiplication of two matrix M and N,∗/
/*并将输出存储到P。]*/
/∗ and stores the output to P.] ∗/
/**/
/∗ ∗/
/*============================*/
/∗================= ===========∗/
空白
void
计算黄金(
computeGold(
float* P, // 结果矩阵数据
float∗ P, // Resultant matrix data
const float* M, // 矩阵 M
const float∗ M, // Matrix M
const float* N, // 矩阵 N
const float∗ N, // Matrix N
int Mh, // 矩阵 M 高度
int Mh, // Matrix M height
int Mw, // 矩阵 M 宽度
int Mw, // Matrix M width
int Nw) // 矩阵 N 宽度
int Nw) // Matrix N width
{
{
整数 i、j、k;
int i, j, k;
浮点数总和,a,b;
float sum, a, b;
for (i = 0; i < Mh; i++)
for (i = 0; i < Mh; i++)
对于 (j = 0; j < Nw; j++)
for (j = 0; j < Nw; j++)
{
{
总和=0;
sum = 0;
对于 (k = 0; k < Mw; k++)
for (k = 0; k < Mw; k++)
{
{
a=M[i*Mw+k];
a = M[i ∗ Mw + k];
b = N[k * Nw + j];
b = N[k ∗ Nw + j];
//printf("A[%d]*B[%d]\n",i*Mw+k,k*Nw+j);
//printf ("A[%d] ∗ B[%d]\n", i ∗ Mw + k, k ∗ Nw + j);
总和+=a*b;
sum += a ∗ b;
}
}
P[i * Nw + j] = (float)sum;
P[i ∗ Nw + j] = (float)sum;
}
}
}
}
该文件包含矩阵-矩阵乘法黄金版的函数原型。
This file contains the function prototype of the gold-version of matrix-matrix multiplication.
/******************************************************************************************************************************************************************************************************************** ***************************
/∗∗∗∗∗∗∗∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
文件名 [matrixmul.h]
File Name [matrixmul.h]
概要 [此文件定义了 gold-versionmatrix-matrix 乘法的函数原型。]
Synopsis [This file defines the function prototype of the gold-versionmatrix-matrix multiplication.]
描述 []
Description []
********************************************************************************************************************************************************************************************************************************** **************************/
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
#ifndef MATRIXMUL_H
#ifndef MATRIXMUL_H
#定义MATRIXMUL_H
#define MATRIXMUL_H
外部“C”
extern "C"
无效计算黄金(
void computeGold(
float* P、const float* M、const float* N、int Mh、int Mw、int Nw);
float∗ P, const float∗ M, const float∗ N, int Mh, int Mw, int Nw);
#万一
#endif
该文件包含辅助函数,可帮助读取、写入和验证矩阵数据文件,使您的实现变得容易。
This file contains helper functions that assist in reading, writing, and verifying matrix data files to make your implementation easy.
/******************************************************************************************************************************************************** ***************************
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
文件名 [assist.h]
File Name [assist.h]
概要 [该文件定义了帮助函数
Synopsis [This file defines the helper functions to assist
在矩阵-矩阵乘法中的文件访问和结果验证。]
In file access and result verification in matrix-matrix multiplication.]
描述 []
Description []
********************************************************************************************************************************************************************************************************************************** **************************/
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
文件*
FILE ∗
打开文件 (
OpenFile (
const char * const fn_p,
const char ∗ const fn_p,
const char * const open_mode_p,
const char ∗ const open_mode_p,
const int if_silent // 如果不显示消息
const int if_silent // If not show messages
)
)
{
{
文件 * f_p = NULL;
FILE ∗ f_p = NULL;
如果(fn_p == NULL){
if (fn_p == NULL) {
printf("空文件名指针。");
printf ("Null file name pointer.");
退出(-1);
exit (-1);
}
}
如果(!if_silent){
if (!if_silent) {
fprintf(stdout,“打开文件 %s …”, fn_p);
fprintf(stdout,"Opening the file %s … ", fn_p);
}
}
f_p = fopen(fn_p, open_mode_p);
f_p = fopen(fn_p, open_mode_p);
如果(f_p == NULL){
if (f_p == NULL) {
如果(!if_silent){
if (!if_silent) {
fprintf(stdout,"失败。\n");
fprintf(stdout,"failed.\n");
} 别的 {
} else {
fprintf(stdout,"\n打开文件 %s … 失败。\n\n", fn_p);
fprintf(stdout,"\nOpening the file %s … failed.\n\n", fn_p);
}
}
退出(-1);
exit (-1);
}
}
if (!if_silent) fprintf(stdout,"成功。\n");
if (!if_silent) fprintf(stdout,"succeeded.\n");
}
}
整数
int
GenMatrix 文件 (
GenMatrixFile (
const char * const matrix_fn_p,
const char ∗ const matrix_fn_p,
const unsigned int M_WIDTH, // 矩阵宽度
const unsigned int M_WIDTH, // matrix width
const unsigned int M_HEIGHT, // 矩阵高度
const unsigned int M_HEIGHT, // matrix height
const int if_silent // 如果不显示消息
const int if_silent // If not show messages
)
)
{
{
文件 * 矩阵_fp = NULL;
FILE ∗ matrix_fp = NULL;
const unsigned int M_SIZE = M_WIDTH * M_HEIGHT;
const unsigned int M_SIZE = M_WIDTH ∗ M_HEIGHT;
无符号整型 * 矩阵 = NULL;
unsigned int ∗ matrix = NULL;
无符号整数 i = 0,j = 0;
unsigned int i = 0, j = 0;
矩阵_fp = OpenFile(矩阵_fn_p,“wb”,1);
matrix_fp = OpenFile (matrix_fn_p, "wb", 1);
矩阵=(无符号整型*)malloc(M_SIZE*sizeof(无符号整型));
matrix = (unsigned int ∗) malloc (M_SIZE ∗ sizeof (unsigned int));
//if (!if_silent) fprintf (stdout, "生成的矩阵内容:\n");
//if (!if_silent) fprintf (stdout, "Generated contents of matrix:\n");
if (!if_silent) fprintf (stdout, " ");
if (!if_silent) fprintf (stdout, " ");
for (i = 0; i < M_HEIGHT; i++) {
for (i = 0; i < M_HEIGHT; i++) {
for (j = 0; j < M_WIDTH; j++) {
for (j = 0; j < M_WIDTH; j++) {
矩阵[i*M_WIDTH + j] = i+j+1;
matrix[i∗M_WIDTH + j] = i+j+1;
if (!if_silent) fprintf (stdout, "%u ", 矩阵[i*M_WIDTH + j]);
if (!if_silent) fprintf (stdout, "%u ", matrix[i∗M_WIDTH + j]);
}
}
if (!if_silent) fprintf (stdout, "\n ");
if (!if_silent) fprintf (stdout, "\n ");
}
}
if (!if_silent) fprintf (stdout, "\n");
if (!if_silent) fprintf (stdout, "\n");
fwrite(矩阵, 1, M_SIZE * sizeof(无符号整型),matrix_fp);
fwrite (matrix, 1, M_SIZE ∗ sizeof (unsigned int), matrix_fp);
fclose(matrix_fp);
fclose (matrix_fp);
自由(矩阵);矩阵=空;
free (matrix); matrix = NULL;
返回(1);
return (1);
}
}
无符号整型 *
unsigned int ∗
读取矩阵文件(
ReadMatrixFile (
const char * const matrix_fn_p,
const char ∗ const matrix_fn_p,
const unsigned int M_WIDTH, // 矩阵宽度
const unsigned int M_WIDTH, // matrix width
const unsigned int M_HEIGHT, // 矩阵高度
const unsigned int M_HEIGHT, // matrix height
const int if_silent // 如果不显示消息
const int if_silent // If not show messages
)
)
{
{
文件 * 矩阵_fp = NULL;
FILE ∗ matrix_fp = NULL;
const unsigned int M_SIZE = M_WIDTH * M_HEIGHT;
const unsigned int M_SIZE = M_WIDTH ∗ M_HEIGHT;
无符号整型 * 矩阵 = NULL;
unsigned int ∗ matrix = NULL;
Matrix_fp = OpenFile(matrix_fn_p, "rb", if_silent);
matrix_fp = OpenFile(matrix_fn_p, "rb", if_silent);
矩阵 = (无符号整型 *) malloc(M_SIZE * sizeof (无符号整型));
matrix = (unsigned int ∗) malloc(M_SIZE ∗ sizeof (unsigned int));
fread(矩阵, 1, M_SIZE * sizeof (无符号整型), Matrix_fp);
fread(matrix, 1, M_SIZE ∗ sizeof (unsigned int), matrix_fp);
如果(!if_silent){
if (!if_silent) {
fprintf (stdout, "读取矩阵的内容:\n");
fprintf (stdout, "Read contents of matrix:\n");
fprintf(标准输出,“”);
fprintf (stdout, " ");
for (i = 0; i < M_HEIGHT; i++) {
for (i = 0; i < M_HEIGHT; i++) {
for (j = 0; j < M_WIDTH; j++) {
for (j = 0; j < M_WIDTH; j++) {
fprintf (stdout, "%u ", 矩阵[i*M_WIDTH + j]);
fprintf (stdout, "%u ", matrix[i∗M_WIDTH + j]);
}
}
fprintf(标准输出,“\n”);
fprintf (stdout, "\n ");
}
}
fprintf(stdout, "\n");
fprintf(stdout, "\n");
}
}
fclose(matrix_fp);
fclose (matrix_fp);
返回(矩阵);
return (matrix);
}
}
整数
int
写入矩阵文件 (
WriteMatrixFile (
const char * const matrix_fn_p,
const char ∗ const matrix_fn_p,
const 无符号整型 * const 矩阵,
const unsigned int ∗ const matrix,
const unsigned int M_WIDTH, // 矩阵宽度
const unsigned int M_WIDTH, // matrix width
const unsigned int M_HEIGHT, // 矩阵高度
const unsigned int M_HEIGHT, // matrix height
const int if_silent // 如果不显示消息
const int if_silent // If not show messages
)
)
{
{
文件 * 矩阵_fp = NULL;
FILE ∗ matrix_fp = NULL;
const unsigned int M_SIZE = M_WIDTH * M_HEIGHT;
const unsigned int M_SIZE = M_WIDTH ∗ M_HEIGHT;
无符号整数 i = 0,j = 0;
unsigned int i = 0, j = 0;
Matrix_fp = OpenFile(matrix_fn_p, "wb", if_silent);
matrix_fp = OpenFile (matrix_fn_p, "wb", if_silent);
fwrite(矩阵, 1, M_SIZE * sizeof(无符号整型),matrix_fp);
fwrite (matrix, 1, M_SIZE ∗ sizeof (unsigned int), matrix_fp);
如果(!if_silent){
if (!if_silent) {
fprintf (stdout, "矩阵的写入内容:\n");
fprintf (stdout, "Written contents of matrix:\n");
for (i = 0; i < M_HEIGHT; i++) {
for (i = 0; i < M_HEIGHT; i++) {
for (j = 0; j < M_WIDTH; j++) {
for (j = 0; j < M_WIDTH; j++) {
fprintf (stdout, "%u ", 矩阵[i*M_WIDTH + j]);
fprintf (stdout, "%u ", matrix[i∗M_WIDTH + j]);
}
}
fprintf(标准输出,“\n”);
fprintf (stdout, "\n");
}
}
}
}
fclose(matrix_fp);
fclose (matrix_fp);
返回(1);
return (1);
}
}
// 用法:
// Usage:
// CompareMatrixFile ("你的输出", "黄金输出", WC, HC, 1);
// CompareMatrixFile ("your output", "golden output", WC, HC, 1);
空白
void
比较矩阵文件 (
CompareMatrixFile (
const char * const matrix_fn_p1,
const char ∗ const matrix_fn_p1,
const char * const matrix_fn_p2,
const char ∗ const matrix_fn_p2,
const unsigned int M_WIDTH, // 矩阵宽度
const unsigned int M_WIDTH, // matrix width
const unsigned int M_HEIGHT, // 矩阵高度
const unsigned int M_HEIGHT, // matrix height
const int if_silent // 如果不显示消息
const int if_silent // If not show messages
)
)
{
{
无符号整型 i = 0,j = 0,错误 = 0;
unsigned int i = 0, j = 0, wrong = 0;
int check_ok = 1;
int check_ok = 1;
无符号 int * m1 = ReadMatrixFile (matrix_fn_p1, M_WIDTH, M_HEIGHT, if_silent);
unsigned int ∗ m1 = ReadMatrixFile (matrix_fn_p1, M_WIDTH, M_HEIGHT, if_silent);
无符号 int * m2 = ReadMatrixFile (matrix_fn_p2, M_WIDTH, M_HEIGHT, if_silent);
unsigned int ∗ m2 = ReadMatrixFile (matrix_fn_p2, M_WIDTH, M_HEIGHT, if_silent);
printf (" 比较文件 %s 与 %s …\n",matrix_fn_p1,matrix_fn_p2);
printf (" Comparing file %s with %s …\n", matrix_fn_p1, matrix_fn_p2);
for (i = 0; i < M_HEIGHT && 错误 < 15; i++) {
for (i = 0; i < M_HEIGHT && wrong < 15; i++) {
for (j = 0; j < M_WIDTH && 错误 < 15; j++) {
for (j = 0; j < M_WIDTH && wrong < 15; j++) {
//printf("m1[%d][%d] ?= m2[%d][%d] : %d ?= %d\n",
//printf ("m1[%d][%d] ?= m2[%d][%d] : %d ?= %d\n",
// i,j,i,j, m1[i*M_WIDTH+j], m2[i*M_WIDTH+j]);
// i,j,i,j, m1[i∗M_WIDTH+j], m2[i∗M_WIDTH+j]);
if (m1[i*M_WIDTH+j] != m2[i*M_WIDTH+j]) {
if (m1[i∗M_WIDTH+j] != m2[i∗M_WIDTH+j]) {
printf ("m1[%d][%d] != m2[%d][%d] : %d != %d\n",
printf ("m1[%d][%d] != m2[%d][%d] : %d != %d\n",
i,j,i,j, m1[i*M_WIDTH+j], m2[i*M_WIDTH+j]);
i,j,i,j, m1[i∗M_WIDTH+j], m2[i∗M_WIDTH+j]);
检查确定=0;错误++;
check_ok = 0; wrong++;
}
}
}
}
}
}
printf("检查没问题吗?");
printf (" Check ok? ");
if (check_ok) printf("通过。\n");
if (check_ok) printf ("Passed.\n");
else printf("失败。\n");
else printf ("Failed.\n");
}
}
漂浮
float
CheckSum(const float *matrix, const int 宽度, const int 高度)
CheckSum(const float ∗matrix, const int width, const int height)
{
{
整数 i,j;
int i, j;
浮点 s1、s2;
float s1, s2;
for (i = 0, s1 = 0; i < 宽度; i++) {
for (i = 0, s1 = 0; i < width; i++) {
for (j = 0, s2 = 0; j < 高度; j++) {
for (j = 0, s2 = 0; j < height; j++) {
s2 += 矩阵[i * 宽度 + j];
s2 += matrix[i ∗ width + j];
}
}
s1 += s2;
s1 += s2;
}
}
返回s1;
return s1;
这是测试矩阵-矩阵乘法实现时的预期输出。
This is the expected output when you test your implementation of matrix-matrix multiplication.
输入矩阵大小:8 x 8
Input matrix size: 8 by 8
设置主机端环境:
Setup host side environment:
为矩阵 M 和 N 分配主机内存。
Allocate host memory for matrices M and N.
中号:8×8
M: 8 × 8
数:8×8
N: 8 × 8
在主机端为结果分配内存。
Allocate memory for the result on host side.
生成矩阵 M 和 N 的输入矩阵数据。
Generate input matrix data for matrix M and N.
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
3 4 5 6 7 8 9 10
3 4 5 6 7 8 9 10
4 5 6 7 8 9 10 11
4 5 6 7 8 9 10 11
5 6 7 8 9 10 11 12
5 6 7 8 9 10 11 12
6 7 8 9 10 11 12 13
6 7 8 9 10 11 12 13
7 8 9 10 11 12 13 14
7 8 9 10 11 12 13 14
8 9 10 11 12 13 14 15
8 9 10 11 12 13 14 15
计算矩阵乘法 M x N:
Computing matrix multiplication M x N:
CPU处理时间:0.009000(毫秒)
CPU Processing time : 0.009000 (ms)
矩阵数据校验和:35456
Matrix data checksum : 35456
矩阵数据内容:
Matrix data contents :
204 240 276 312 348 384 420 456
204 240 276 312 348 384 420 456
240 284 328 372 416 460 504 548
240 284 328 372 416 460 504 548
276 328 380 432 484 536 588 640
276 328 380 432 484 536 588 640
312 372 432 492 552 612 672 732
312 372 432 492 552 612 672 732
348 416 484 552 620 688 756 824
348 416 484 552 620 688 756 824
384 460 536 612 688 764 840 916
384 460 536 612 688 764 840 916
420 504 588 672 756 840 924 1008
420 504 588 672 756 840 924 1008
456 548 640 732 824 916 1008 1100
456 548 640 732 824 916 1008 1100
正如我们在第 6 - 10章中讨论的,最大化特定 GPU 上的内核性能需要了解 GPU 硬件中的资源限制。因此,每个 GPU 中的主要硬件资源配置通常暴露给称为计算能力的标准化系统中的应用程序。计算设备的一般规格和功能取决于其计算能力。对于 CUDA,计算能力从 Compute 1.0 开始,在撰写本文时,最新版本是 Compute 3.5。每一个更高级别的计算能力都表明新一代 GPU 设备具有更多功能。表 B.1重点介绍了支持每种计算功能之间的关键功能差异。未列出的功能可被视为受所有计算能力变体支持;B.2 节讨论了内存合并的差异。一般来说,较高级别的计算能力定义了较低级别的功能的超集。
As we discussed in Chapters 6-10, maximizing the kernel performance on a particular GPU requires knowledge of the resource limitations in the GPU hardware. Therefore, the main hardware resource provisions in each GPU are typically exposed to applications in a standardized system called compute capability. The general specifications and features of a compute device depend on its compute capability. For CUDA, the compute capability starts at Compute 1.0, and at the time of this writing the latest version is Compute 3.5. Each higher level of compute capability indicates a newer generation of GPU devices with a higher number of features. Table B.1 highlights the key features support differences between each of the compute capabilities. Features not listed can be considered supported by all compute capability variations; differences in memory coalescing are discussed in Section B.2. In general, a higher-level compute capability defines a superset of features of those of a lower level.
表 B.1 CUDA 计算能力之间的关键功能支持差异
Table B.1 Key Functional Support Variations Between CUDA Compute Capabilities
表B.2给出了计算能力规范的主要维度,并给出了Compute 3.5各维度的数值。计算能力的每一个更高级别都会增强这些维度中的一个或多个。
Table B.2 shows the main dimensions of compute capability specifications and gives the numerical value of each dimension for Compute 3.5. Each higher level of compute capability enhances one more of these dimensions.
表B.2计算能力主要维度及计算3.5的属性
Table B.2 Main Dimensions of Compute Capability and the Attributes of Compute 3.5
| 特征 | 计算3.5 |
| 每个多处理器 (MP) 的流处理器数量 | 192 |
| 最大限度。每个块的线程数 | 1,024 |
| 最大限度。网格尺寸X , Y , Z | 2 31 - 1,65535,65535 |
| 最大限度。块尺寸X , Y , Z | 1,024, 1,024, 64 |
| 经纱中的螺纹 | 32 |
| 每个 MP 的寄存器数 | 65,536 (64 K) |
| 每个 MP 的共享内存 | 49,152 (48 K) |
| 共享内存中的银行 | 32 |
| 总常量内存 | 65,536 (64 K) |
| 每个 MP 常量的缓存工作集 | 8,192 (8 K) |
| 每个线程的本地内存 | 524,288 (512 K) |
| 每个 MP 纹理的缓存工作集 | 6–8 KB |
| 最大限度。每个 MP 的活动块数 | 16 |
| 最大限度。每个 MP 的活动扭曲数量 | 64 |
| 最大限度。每个 MP 的活动线程数 | 2,048 |
| 绑定到 CUDA 数组的 1D 纹理 - 最大宽度 | 65,536 |
| 绑定到线性内存的 1D 纹理 - 最大宽度 | 2 27 |
| 2D 纹理绑定到线性内存或 CUDA 数组;最大限度。尺寸X , Y | 分别为 65,000 和 65,536 |
| 绑定到 CUDA 数组的 3D 纹理最大数量尺寸X , Y , Z | 4K×4K×4K |
| 最大限度。立方体贴图的宽度、高度和图层 - 分层纹理参考 | 16,384×16,384×2,046 |
| 最大限度。可以绑定到内核的纹理数量 | 256 |
| 绑定到 CUDA 阵列的 1D 表面参考 - 最大宽度 | 65,536 |
| 一维分层表面参考 - 最大宽度和层数 | 65,536×2,048 |
| 2D 分层表面参考 — 最大宽度、高度和层数 | 65,536×32,768×2,048 |
| 3D 分层表面参考 — 最大绑定到 CUDA 数组的宽度、高度和深度 | 65,536×32,768×2,048 |
| 最大限度。立方体贴图的宽度、高度和图层 - 分层表面参考 | 32,768×32,768×2,046 |
| 最大限度。表面数 | |
| 最大限度。每个内核的指令数 | 5.12 亿条微代码指令 |
根据推出时间的不同,每个支持 CUDA 的设备最多支持特定一代的计算能力。每年都会推出许多支持 CUDA 的设备。读者应参阅http://developer.nvidia.com/cuda-gpus以获取更新列表。
Depending on the time of its introduction, each CUDA-enabled device supports up to a particular generation of compute capability. Many CUDA-enabled devices are introduced each year. Readers should refer to http://developer.nvidia.com/cuda-gpus for an updated list.
许多特定于设备的功能和尺寸可以通过调用运行时 CUDA 函数cudaGetDeviceProperties()来确定。有关更多详细信息,请参阅CUDA 程序员指南。
Many device-specific features and sizes can be determined calling runtime CUDA function cudaGetDeviceProperties(). See the CUDA Programmer Guide for more details.
每个级别的计算能力还指定不同级别的硬件内存合并能力。了解计算能力后,我们可以确定扭曲中的加载指令将产生的全局内存事务的数量。后来的计算功能(例如 2.x 及更高版本)大大减少了内存事务的数量和非合并访问的发生。在计算 1.0 和计算 1.1 中,内存事务是针对内存 64 B 或 128 B 段完成的。 Warp 中的访问合并要求Warp 中的第k个线程在访问 32 位字时访问 64 B 段中的第 k个字(或者在访问 128 位字时访问两个连续 128 B 段中的第 k个字) 。并非所有线程都需要参与合并才能工作。在图 B.1的顶部,warp 中的一个线程没有参与,并且访问仍然合并到一个事务中。
Each level of compute capability also specifies a different level of hardware memory coalescing capability. Knowing the compute capability, one can determine the number of global memory transactions that a load instruction in a warp will incur. Later compute capabilities such as 2.x and higher substantially reduce the number of memory transactions and occurrence of noncoalesced accesses. In Compute 1.0 and Compute 1.1, memory transactions are done for either memory 64 B or 128 B segments. Coalescing of accesses in a warp requires that the kth thread in a warp access the kth word in a 64 B segment when accessing 32-bit words (or the kth word in two contiguous 128 B segments when accessing 128-bit words). Not all threads need to participate for coalescing to work. In the top of Figure B.1, one of the threads in a warp did not participate and the accesses are still coalesced into one transaction.
图 B.1计算 1.0 和计算 1.1 中的内存合并。
Figure B.1 Memory coalescing in compute 1.0 and Compute 1.1.
特别是,所有访问必须按顺序进行。如果一项或多项访问失序,则这些访问将不再合并。在图B.1的中间,有两个访问是乱序的。因此,访问不会合并;为了访问,对全局存储器进行了 16 个事务。
In particular, all accesses must be in sequence. If one or more of the accesses are out of sequence, the accesses will no longer be coalesced. In the middle of Figure B.1, two of the accesses are out of sequence. The accesses are therefore not coalesced; 16 transactions to the global memory are done for the access.
在 Compute 1.2 及更高版本中,全局内存事务以 32 B、64 B 或 128 B 段发出。较小的段大小允许硬件减少一些不太连贯的扭曲访问模式对全局内存带宽的浪费。
In Compute 1.2 and higher, the global memory transactions are issued in 32 B, 64 B, or 128 B segments. Having a smaller segment size allows the hardware to reduce waste of global memory bandwidth for some less coherent warp access patterns.
图 B.2说明了 Compute 1.2 中内存合并的改进。顶部显示段内的扭曲访问可能不按顺序进行,但仍会完全合并到一个事务中。
Figure B.2 illustrates improvements in memory coalescing in Compute 1.2. The top part shows that warp accesses within a segment can be out of sequence and still be fully coalesced into one transaction.
图 B.2计算 1.2 及更高版本中的内存合并。
Figure B.2 Memory coalescing in compute 1.2 and higher.
中间部分显示访问可以跨 128 B 边界不对齐。将发出一笔额外的 32 B 段事务,并且访问仍会合并。底部部分显示,如果扭曲访问未对齐但保持在 128 B 边界内,则将使用单个 128 B 段事务来访问所涉及的所有字。在这两种情况下,全局内存带宽消耗远小于计算 1.0 或计算 1.1 中的消耗,其中将使用 64 B 段的 16 个事务。
The middle part shows that the access can be nonaligned across a 128 B boundary. One extra 32 B segment transaction will be issued and the accesses are still coalesced. The bottom part shows if warp accesses are nonaligned but stay within a 128 B boundary, a single 128 B segment transaction will be used to access all the words involved. In these two cases, the global memory bandwidth consumption is much less than that in Compute 1.0 or Compute 1.1 where 16 transactions of 64 B segments would be used.
图 B.3说明了计算 2.0 中引入的改进,导致所有对齐的内存访问被视为合并并消除了额外的内存事务。
Figure B.3 illustrates the improvements introduced in Compute 2.0 resulting in all aligned memory accesses to be considered coalesced and eliminating additional memory transactions.
图 B.3全局内存访问和每种计算能力产生的内存事务的示例。
Figure B.3 Examples of global memory access and resulting memory transactions for each compute capability.
注:页码后跟“ f ”和“ b ”分别指数字和方框。
Note: Page numbers followed by “f” and “b” refer to figures and boxes respectively.
A
A
乙
B
C
C
D
D
乙
E
F
F
G
G
H
H
我
I
K
K
L
L
中号
M
氮
N
氧
O
磷
P
问
Q
右
R
S
S
时间
T
U
U
V
V
瓦
W
X
X
Z
Z